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<N . Abstract 

^ , Background Empirical substitution matrices represent the average tendencies of substitutions over 

' various protein families by sacrificing gene-level resolution. We develop a codon-based model, in which 

mutational tendencies of codon, a genetic code, and the strength of selective constraints against amino 
acid replacements can be tailored to a given gene. First, selective constraints averaged over proteins are 
estimated by maximizing the likelihood of each 1-PAM matrix of empirical amino acid (JTT, WAG, and 
LG) and codon (KHG) substitution matrices. Then, selective constraints specific to given proteins are 
approximated as a linear function of those estimated from the empirical substitution matrices. 
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^ O . Results Akaike information criterion (AIC) values indicate that a model allowing multiple nucleotide 

I changes fits the empirical substitution matrices significantly better. Also, the ML estimates of transition- 

transversion bias obtained from these empirical matrices are not so large as previously estimated. The 
selective constraints are characteristic of proteins rather than species. However, their relative strengths 
among amino acid pairs can be approximated not to depend very much on protein families but amino acid 
pairs, because the present model, in which selective constraints are approximated to be a linear function 
of those estimated from the JTT/WAG/LG/KHG matrices, can provide a good fit to other empirical 
I substitution matrices including cpREV for chloroplast proteins and mtREV for vertebrate mitochondrial 

. proteins. 
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^nJ" ' Conclusions/Significance The present codon-based model with the ML estimates of selective con- 

, straints and with adjustable mutation rates of nucleotide would be useful as a simple substitution model 

in ML and Bayesian inferences of molecular phylogenetic trees, and enables us to obtain biologically 
meaningful information at both nucleotide and amino acid levels from codon and protein sequences. 
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Introduction 

Any method for inferring molecular phylogeny is implicitly or explicitly based on the evolutionary mech- 
anism of nucleotide or amino acid substitutions, and the reliability of phylogenetic analyses strongly 
depends on models assumed for the substitution processes of nucleotide and amino acid. Mutational 
events occur at the individual nucleotide level, but selective pressure primarily operates at the amino 
acid level. Thus, a codon-based model of amino acid substitutions has a potential to be preferable to 
both mononucleotide substitution models [IH3] and amino acid substitution models [iHl2j. because it 
can take into account both mutational tendencies at the nucleotide level and selective pressure on amino 
acid replacements as well as the knowledge of a genetic code. Schneider et al. [13] and Kosiol et al. [I4] 
empirically estimated a codon substitution matrix from a large number of coding sequence alignments. 
However, the tendencies of substitutions differ among nuclear, mitochondrial [6], and chloroplast genes 
[8]. Delport et al. [T5j[l6] pointed out that empirical substitution matrices represent the average ten- 
dencies of substitutions over various protein families by sacrificing gene-level resolution. A mechanistic 
codon substitution model, in which one can change a genetic code, and adjust mutational tendencies at 
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the codon level and selectional preferences on amino acid replacements, is potentially more superior than 
empirical codon substitution matrices. 

A main difference between the current mechanistic codon substitution models [Tl llSfB^ resides in 
the estimation of selective constraints against amino acid replacements. (1) In [TOll^l^ . the difference 
between nonsynonymous and synonymous substitution rates was taken into account but the amino acid 
dependences of selective constraints were not taken into account; i.e., single selective constraints. (2) In 
[7lll7[[T8]. selective constraints against amino acid replacements were evaluated from physico-chemical 
properties of amino acids. (3) In |2 HI23l[24] . codon exchangeabilities for nonsynonymous changes were 
evaluated from those in empirical amino acid substitution matrices. (4) In |15lll6j . selective constraints 
were grouped, and the number of groups and the strength of selective constraint of each group were 
optimized for a given protein phylogeny. The fourth method has the highest resolution of selective 
constraints employing as many substitution groups as necessary. However, it seems to be a very computer- 
intensive calculation [TB]. Here, we try to estimate selective constraint for each type of amino acid 
replacement by maximizing the likelihood of individual empirical substitution matrices. Unlike the present 
method, in the previous methods of this third category codon exchangeabilities for nonsynonymous 
changes were assumed to be proportional to the corresponding amino acid exchangeability [23] , or a codon 
substitution matrix was restricted to yield amino acid exchangeabilities equal to empirically-derived ones 
j21j . The empirical substitution matrices fitted are 1-PAM amino acid substitution frequency matrices, 
the JTT matrix [5] , the WAG matrix [TO] , and the LG matrix [TT| , evaluated from relatively large data of 
nuclear-encoded proteins, the mtREV matrix [B] from vertebrate mitochondrial proteins, and the cpREV 
matrix ^ from chloroplast-encoded proteins, and also a 1-PAM codon substitution frequency matrix 
(KHG) [14]. In the following, these empirical substitution frequency matrices corresponding to 1 PAM 
will be simply referred to by their common acronyms, JTT, WAG, LG, KHG, mtREV, and cpREV. 

In most of the reversible Markov models for codon substitutions, instantaneous rates for codon substi- 
tutions that require multiple nucleotide changes were assumed to be equal to 0. [rSlfTTHT^ . However, in 
all empirical substitution matrices unnegligible amounts of rates are assigned to amino acid replacements 
that require multiple nucleotide changes. Variations in substitution rates or time intervals would yield 
significant amounts of probabilities for the multi-step substitutions. Alternative explanation is that the 
significant fraction of these substitutions occurred with multiple nucleotide changes. Thus, both of them 
are taken into account in the present work. It is assumed that substitution rates are distributed with a 
F distribution. The use of F distribution for rate variation has been attempted in many studies |25[|26| . 
Multiple nucleotide changes are assumed to occur in the same order of time as single nucleotide changes 
do. 

Interdependence of nucleotide substitutions at three codon positions [7] and also spanning codon 
boundaries [20] have been pointed out. Evidences for a high frequency, which is the order of 0.1 per site 
per billion years, of double- nucleotide substitutions were found in diverse organisms by Averof et al. |27| . 
although there is a report [28] indicating a low rate of double-nucleotide mutations in primates. Bazykin 
et al. }29j pointed out a possibility of successive single compensatory substitutions for multiple nucleotide 
changes. Recently, many codon models relaxing mathematical assumptions in a more sophisticated way 
than the models of Goldman and Yang [18] and Muse and Gaut [19] are devised to study and to detect 
evidence of positive selection in codon evolutionary processes; see Anisimova and Kosiol j30j for a review. 

In the Singlet-Doublet-Triplet (SDT) mutation model [20] , single-nucleotide, doublet and triplet mu- 
tations spanning codon boundaries are taken into account, but double nucleotide mutations at the first 
and the third positions in a codon were not taken into account. The dependences of selective constraints 
on amino acid pairs were not taken into account. In the present model, it is assumed that nucleotide 
mutations occur independently at each codon position and so any double nucleotide mutation occurs as 
frequently as doublet mutations. The codon substitution rate matrix of KHG [13] indicates that some 
types of double nucleotide mutations at the first and the third positions frequently occur. 

Close relationships between selective constraints on amino acids and physico-chemical properties of 
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amino acids and protein structures have been pointed out pil^ [T7ll5Tk34| . We suppose that the relative 
strengths of selective constraints among amino acid pairs do not strongly depend on species, organelles, 
and even protein families but amino acid pairs. Then, we examine the performance of the present 
codon-based model, in which selective constraints are approximated to be a linear function of those 
estimated from JTT, WAG, LG, or KHG, in respect of how well other empirical substitution matrices 
including cpREV and mtREV can be fitted by adjusting parameters such as mutational tendencies and 
the strength of selective constraints. It is shown that these maximum likelihood (ML) estimators of 
the selective constraints perform better than any physico-chemical estimation. It is also indicated that 
the present model yields good values of Akaike information criterion (AIC) for a phylogenetic tree of 
mitochondrial coding sequences in comparison with the codon model almost equivalent to mtREV. If the 
present model is applied to the ML inference of phylogenetic trees, it will allow us to estimate mutational 
tendencies at the nucleotide level, which are specific to each species and organelle, such as transition- 
transversion bias and the ratio of nonsynonymous to synonymous rate. One of the interesting results 
revealed by the present model is that the ML estimators of transition to transversion bias calculated 
from the empirical substitution matrices are not so large as previously estimated. Also, AIC values 
indicate that a model allowing multiple nucleotide changes fits the empirical substitution matrices and 
the phylogeny of vertebrate mitochondrial proteins significantly better. 

The present codon-based model with the new estimates for selective constraints on amino acids is 
useful as a simple evolutionary model for phylogenetic estimation, and also useful to generate log-odds 
for codon substitutions in protein-coding sequences with any genetic code. 

Methods 



A mechanistic codon substitution model with multiple nucleotide changes 

In early codon substitution models [171I18| . the probabilities of multiple nucleotide replacements in the 
infinitesimal time difference At were completely neglected by assuming them to be O(At^), when the 
probabilities of single nucleotide replacements are taken to be 0{At). In other words, the instantaneous 
mutation rate M^y from codon jiio v was assumed to be equal to zero for codon pairs requiring multiple 
nucleotide replacements. However, multiple nucleotide mutations may not be neglected in real protein 
evolution [7l[T4l[20l[27l[29l|35]. Here, multiple nucleotide changes are assumed to occur with the same 
order of time as single nucleotide changes occur, but unlike the SDT model [20] a mutation process is 
simplified in such a way that mutations independently occur at each position of a codon. Thus, the 
mutation rate matrix for a codon is defined here as 

3 
i=l 

where Bi is a mutation rate matrix between the four types of nucleotides at the ith codon position, 
5^iVi is the Kronecker's (5, and the index means the ith nucleotide in the codon fi; fj, = {fJ.i, fi2, fJ-s) 
where /ii 6 { a, t, c, g } . Assuming that the rate matrix Bi satisfies the detailed balance condition, it is 
represented as 

{Bi)fiiUi 

rmut 



{m^)^.^.Jf::: for * = 1,2, 3 (2) 
{mi)u.fj,. (3) 

rmut ymut xmut ( a\ 
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where /,-™* is the equiUbrium composition of nucleotide Vi at the it\\ codon position, and {mi)^.i,. is 
the exchangeabihty between nucleotides /i^ and Vi at the ith codon position. As a result of the detailed 
balance condition assumed for the Bi, the M also satisfies the detailed balance condition; 

/-"'Af^, = fr'M,^ (5) 

The instantaneous substitution rate R^^i, from codon fi to v can be represented as the product of 
the mutation rate A/^j, and the average rate of fixation Ffj^^, , which is defined to be the average fixation 
probability multiplied by the chromosomal population size, for mutations from codon ^ to v under 
selection pressure; oc M^^F^^, for ^ ^ v. Let us assume that the R also satisfies the detailed 
balance condition; that is, 

ffiRfiv ~ URvfi (6) 

where is the equilibrium codon composition of the substitution rate matrix R. The detailed balance 
condition Eq. [6] for the R is equivalent with a condition that R^i, can be expressed to be a product of the 
(/i, u) element of a symmetric matrix and the equilibrium composition f^. Similarly, the detailed balance 
condition Eq. [5] for the M is equivalent with a condition that the matrix whose (^, v) element is equal 
to M^i^//™"* is symmetric. Thus, the detailed balance conditions for the M and the R require that the 
average fixation rate F^^ must be represented as the product of the two terms, fu/ f'J^^^ and e'^''" , where 
w^i/ = Wj,^; F^„ = (/iy//™"*)e™^'' for ^ v- Then, the codon substitution rate i?^^ can be represented 
as 



= Const AV-^^e'"- ior^i^v (7) 



where Const is an arbitrary scaling constant. By taking the frequencies of stop codons to be zero, the 
probability fiow from any codon to a termination codon and its inverse flow are set to zero. The unit of 
time is chosen by determining the arbitrary scaling constant Const in Eq. [7] in such a way that the total 
rate of the rate matrix R is equal to one; 



-E/m^!/^^ = 1 (8) 



Therefore, only the relative values among M^^ are meaningful. The frequency-dependent term fi^/ f™^^ 
represents the effects of selection pressures at the DNA level as well as at the amino acid level, which 
change the codon frequency from the mutational equilibrium frequency /™"* to the frequency fi, speciflc 
to a gene. The fixation rate is obviously equal to for lethal mutations and 1 for neutral mutations. Here, 
we approximate the average quantity e"''"' over mutants to be independent of codon frequencies. The 
quantity e'"f"' is the same as the one called the rate of acceptance by Miyata et al. [35] ■ We assume that 
selection pressure against codon replacements principally appears on an amino acid sequence encoded 
by a nucleotide sequence; Wf^i, for the codon pair v) is equal to the selective constraint Wab for the 
encoded amino acid pair (a, b). 

Y.aHbe{ amino acids } C^aC^be'""'' 

for fijiy ^ { stop codons } and [i^v (9) 
for [I ox V ^ \ stop codons } and H ^ v 

where C^a is a genetic code table and takes the value one if codon /i encodes amino acid a, otherwise zero. 
At the amino acid level, there should be no selection pressure against synonymous mutations. Thus, the 
Wab satisfies 



Wab = VJba , Waa = 



(10) 
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The matrix w will be directly estimated by maximizing the likelihood of an empirical substitution 
matrix, or it will be evaluated for a specific protein family as a linear function of such an estimate of Wab, 

Wab = M"+U;o(l-'5,b) (11) 

In Eq. [TI] Sab is the Kronecker's S, and w^f™^'^^° means the estimate of Wab, which is either a physico- 
chemical estimate or a ML estimate calculated from a specific substitution matrix, and satisfies Eq. 1101 
The parameter /3, which is non-negative, adjusts the strength of selective constraints for a protein family. 
The parameter wq controls the ratio of nonsynonymous to synonymous substitution rate, but it will be 
ineffective and may be assumed to be equal to if amino acid sequences rather than codon sequences are 
analyzed. 

Then, the substitution probability matrix S{t) at time t in a time-homogeneous Markov process can 
be calculated as 

S{t) = exp(ffi) (12) 

Because the rate matrix R satisfies the detailed balance condition, the S{t) also satisfies it. Therefore, 
a substitution process is modeled as a reversible Markov process. The S{t) and the R that satisfy 
the detailed balance condition can be easily diagonalized with real eigenvalues and eigenvectors |17| : 
the eigenvalues of R are the same as those of a symmetric matrix whose (/i, v) element is equal to 

If multiple nucleotide changes were completely ignored, then Eq. [1] would be simplified as M^^ = 

((1 ^ ^l^ll^l){Bl)fJ,il^iSfj_2iy2^IJ.3iy3) + (1 ^ '5ai21'2)(^2)ai21'2'^A'31^3) + ('^Ml I'l '^A'2 1^2 ( ^ ^ '^MSl^a ) ("^3 )/i3 ) ' wfiOSB 

formulation for a codon mutation rate matrix with Eq. [2] is essentially the same as the one proposed by 
Muse and Gault [12]. Here, it should be noted that {Bi)^.i,. in Eq. [5] is defined to be proportional to 
the equilibrium nucleotide composition /™^*. Alternatively, one may define M^j, as A/^i, = ni=i[^Mii^i + 
(1 — ^ iJ.iUi){rni) fj_.v-]f™^^ in the same way as Miyazawa and Jernigan [T^ and others [71IT5] defined it to 
be proportional explicitly to the composition of the base triplet, /™"*. This alternative definition with 
Eqs. [7] and [5] is equivalent to Eqs. [T] and [2] with /™"* = 0.25, and thus it is a special case in the present 
formulation; see [36] for justifications of this alternative definition. 

In the present analyses, we assume for simplicity that {mi)fj_-^. and /j™"' do not depend on codon 
position i; that is, (mi)^ri = m^ri and /"™* where 77 G {a, t, c, g}. This assumption is reasonable 

because mutational tendencies may not depend on a nucleotide position in a codon. Let us define TO[tc][ac/] 
to represent the cLV6rcLgG of tliG GxciiangG abilities of the trcinsversion type, Ttiid^ '^tgi ^^cai 

and rucg, and 

likewise mtc\ag to represent the average of the exchangeabilities of the transition type, rritc and niag- We 
use the ratios {m^ri / fn^tc\[ag\} as parameters for exchangeabilities, and ra^tc\[ag\ to represent the ratio of 
the exchangeability of double nucleotide change to that of single nucleotide change and also the ratio 
of the exchangeability of triple nucleotide change to that of double nucleotide change; note that the 
exchangeabilities of single, double, and triple nucleotide changes are of 0(m[jc][ag]), O(to^j^jj^^j), and 
^("^ftc][a£(]) n respectively, and that Eq. |8]must be satisfied. Then, multiple nucleotide changes 

in a codon can be completely neglected by making the parameter 77i[tc][ag] approach zero with keeping 
{m^,~i / m^tc\[ag]\ constant in Eq. [U Also, it is noted that double nucleotide changes at the first and the 
third positions in a codon are assumed to occur as frequently as doublet changes. 



Empirical substitution matrices used for model fitting 

Maximum likelihood (ML) values are calculated for each 1-PAM substitution frequency matrix, which 
corresponds to the time duration of 1 amino acid substitution per 100 amino acids, of the JTT [S], the 
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WAG [TU], the LG [TT], the cpREV [5], and the mtREV amino acid substitution matrices, and of 
the KHG codon substitution matrix [T3]. We have arbitrarily chosen the transition matrices of 1-PAM, 
whose time interval is long enough for the significant number of substitutions to occur and also too short 
for multi-step substitutions to cover multiple nucleotide changes. JTT is an accepted point mutation 
matrix compiled from the pairs of closely related proteins encoded in nuclear DNA. WAG, LG, cpREV, 
and mtREV are amino acid substitution matrices estimated by maximizing the likelihood of a given set 
of optimum phylogenetic trees. The KHG matrix used is the one named EGMunrest in the supplement 
of their paper, for which multiple nucleotide changes are allowed. JTT, WAG, LG, and KHG were all 
calculated from nuclear-encoded proteins, although JTT was calculated by a different method from the 
others. The matrices of cpREV and mtREV were calculated from proteins encoded in chloroplast DNA, 
and in vertebrate mitochondrial DNA, respectively. It should be noted here that a non-universal genetic 
code is used in the mitochondrial DNA. 



Average of a transition matrix over time or over rate 

In the present study, model parameters are estimated by maximizing the likelihood of each 1-PAM 
substitution frequency matrix of JTT, WAG, LG, cpREV, mtREV, and KHG. In the case of JTT, the 
pairs of closely related sequences were used to count substitutions and the transition matrix was calculated 
by completely neglecting multiple substitutions at a site in a parsimony method. Thus, JTT should be 
considered to consist of substitutions that occurred in various time intervals (various branch lengths) . The 
substitution rate matrices of WAG, LG, mtREV, cpREV and KHG were estimated by the ML method for 
a given set of protein phylogenetic trees. Each site of protein families may have evolved with a different 
rate. As a result, these substitution matrices may be regarded as an average over different substitution 
rates. Here we assume that evolutionary time intervals or substitution rates for each substitution matrix 
are distributed in a F distribution. There have been many attempts [251126] of using a F distribution for 
rate variation. 

If the substitution rate matrix R is assumed to vary only by a scalar factor, the mean of a substitution 
matrix irrespective of over-time and over-rate will be calculated as 

POO 

{S){T,a) = / S{t)r{t;T,<j)dt 
Jo 

exp{-(/ - ai?)-}(-r-i- = [(/ - aR)-r (13) 
(7 a a 

where F(t; r, cr) is the probability density function of a F distribution with a scale parameter a and a 
shape parameter r, F(t) is the F function, and / is the identity matrix. The mean and the variance of 
the F distribution F(t;r, cr) are equal to rcr and tct^, respectively. Here we should recall that the rate 
matrix R is normalized such that the total rate per unit time is equal to one; see Eq. [51 
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Evaluation of the log-likelihood of an empirical substitution matrix 

The log- likelihood of the empirical frequency, A^a = -^1°^^^°^^^ of substitutions from k to A in the 
present model can be calculated as 

m = iV^^/f^5°^^log(/,(5)(r,a),A) (W) 

K A 

where k and A mean one of the amino acid types for amino acid substitution matrices or one of the codon 
types for codon substitution matrices, 5'°'"^ is an observed transition probability matrix corresponding 
to the accepted point mutation matrix A, f°^^^ is the observed composition of amino acid or codon k, 
and N is the total number of amino acid or codon sites compared to count substitutions. The observed 
composition f°^^ is assumed to be the equilibrium composition of 5'°'^''. is a set of parameters and 
9 = argmaxg i{6) is a set of the maximum likelihood (ML) estimators. Similarly, the estimate /kl 
of the KuUback-Leibler (K-L) information by replacing the real distribution to the observed frequency 
distribution is calculated as 

/kl(0) 

= E E /«'^^°A^[log(/°'^^°A^) - ^og{US){T, a),,)] (15) 

K A 

= -^(0)/iV + ^^/f^5°^^log(/f^^°^^) (16) 

K A 

Maximum log-likelihood £{9) corresponds to the mininmm of the estimate of K-L information, /kl(^). 

The transition probability, S{t)ab, between amino acids a and b and the composition, fa, of amino 
acid a are related to those for codons as follows. 

faS{t)ab = EE^'^--^'^^(*)'^-^-'' (1^) 

fj, y 

fa = Y.^,af^. (18) 

The goodness of a model and the significance of parameters can be indicated by Akaike Information 
Criterion (AIC). The AIC value is defined as 

AIC 

= —2£{0) + 2 ■ (number of adjustable parameters) (19) 
AAIC 

^ AIC-f2iV^^/f^^°^nog(/f^5°^^) (20) 

K A 

= 2A^/kl(^) + 2 • (number of adjustable parameters) (21) 

For convenience, AAIC, which is equal to a constant value added to the AIC value, is also defined above. 
The AIC and AAIC always take a non-negative value. Models with smaller AIC and AAIC can be 
considered to be more appropriate |37j . 

Parameters in the present model are /3, m^ri-, f™'^^-, fr/, and a. Assuming that the observed process 
of substitutions is in the stationary state, the estimates of the equilibrium codon and the equilibrium 
amino acid compositions, and /a, are taken to be the observed composition of the codon and of the 
amino acid: 

U = /a=/a°'^ (22) 
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In the case of amino acid sequences, for which their coding sequences are not available, codon compositions 
may be parameterized by 



J v~> /-^ /.usage l^'^j 



Z^a ^ t^aJaJ fi 

/•usage j'usage Tusage ^usage 10A\ 

In the present analyses, this parameterization is used for the equilibrium codon compositions in amino 
acid sequences. 

Then, the shape parameter r of a F distribution for variations in mutation rates or evolutionary time 
intervals for observed codon or amino acid substitutions is estimated by equating the ratio of the expected 
number of substitutions in the model to its observed value. 

= E/°''^- (25) 

K K 

Other parameters /?, m^^, /™"*, f^^'^^'^, and a are evaluated as ML estimators or fixed to a proper value. 
The observed transition matrix S°\^ corresponding to 1-PAM is used here; PAM means accepted point 
mutations per 100 amino acids. 

E/«°'^^- = 0-99 (26) 



The total number of site comparisons (N) for each empirical substitution ma- 
trix 

In the case of JTT, 59190 accepted point mutations found in 16130 protein sequences were used to build 
a substitution probability matrix of 1-PAM [5]. Thus, the total number N of amino acid comparisons 
for JTT is assumed to be equal to = 59190/0.01. On the other hand, a phylogenetic tree for cpREV 
is based on 9957 amino acid sites of 45 proteins encoded in chloroplast DNAs of 9 species [8], and the 
one for mtREV is based on 3357 amino acid sites of the complete mitochondrial DNA from 20 vertebrate 
species (3 individuals from human) [6]. Thus, the total number of site comparisons N for them may be 
approximated to be equal to the number of amino acid sites multiplied by the number of branches in the 
phylogenetic tree used to evaluate the transition matrices; that is, A^ « 9957 • (2 • 10 — 3) = 169269 for 
cpREV, and A^ « 3357 • (2 • 22 - 3) = 137637 for mtREV. The BRKALN database consisting of 50867 
sites and 895132 residues was used to estimate WAG. Thus, A^ « 895132 • 2 - 50867-3 = 1637663 is used 
for WAG [iniin]. To evaluate LG, 3412 of 3912 alignments consisting of 49637 sequences, 599692 sites, 
and 6697813 residues are used [Tl]- Therefore, N w (6697813 • 2 - 599692 • 3) ■ 3412/3912 10114373 is 
assumed for LG. These crude estimates of A^ are used to evaluate the AICs of JTT, WAG, LG, cpREV 
and mtREV. 

In the case of KHG, which was estimated by maximizing a likelihood of a set of phylogenetic trees of 
coding sequences of 7332 nuclear protein families taken from Pandit database [55] : the total numbers of 
residues and sites are not written in Kosiol et al. [14] , so that an AIC value is not given for KHG in the 
following. 

Results 

Models, each of which includes a different number of parameters and is a special case of models includ- 
ing more parameters, are fitted by a maximum likelihood method to each of the 1-PAM amino acid 
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substitution frequency matrices, JTT [5], WAG [TU], and LG [TT] for proteins encoded in nuclear DNA, 
cpREV [5] for chloroplast DNA, and mtREV [B] for mitochondrial DNA. Also, the models are fitted to 
the 1-PAM codon substitution frequency matrix of KHG [HI for nuclear DNA. The selective constraints 
Wab are either directly estimated by ML or evaluated from a known estimate y;Ostimate j-^y [TT] that 
includes two parameters (3 and wq. The parameter wq is fixed here to for amino acid substitution 
matrices because the likelihood of an amino acid substitution matrix does not strongly depend on wq; 
codon substitution data are required to reliably estimate the value of wq, which significantly affects the 
ratio of nonsynonymous to synonymous substitution rate. Each model is named to indicate either the 
method to estimate Wab or the name of 7j;C|timatc ^j^j-^ suffix meaning the number of ML parameters. 
Each model is briefly described in Table [T] The Nelder-Mead Simplex algorithm has been used for the 
maximization of likelihoods. 



The effects of selective constraints 

First, the No- Constraints models, in which selective constraints do not depend on amino acid pairs, /? = 
in Eq. were examined to see how well nucleotide mutation rates, codon frequencies and a genetic code 
can explain the observed frequencies of amino acid substitutions in JTT, WAG, cpREV, and mtREV; 
the No-Constraints models disallowing multiple nucleotide changes are equivalent to mononucleotide 
substitution models, because wo = is used here. The AAIC value and the ML estimates for each 
parameter set are listed in Table [2] and Table IS1[ respectively. Please refer to Supporting Information, 
Text SI, for details. These No-Constraints models serve as a reference to measure how selection models 
can improve the likelihoods. Then, we examine various estimations of selective constraints on amino 
acids based on the physico-chemical distances of amino acids evaluated by Grantham [3T] and by Miyata 
et al. |32| and mean energy increments due to an amino acid substitution. These models are called 
Grantham, Miyata, and Energy-increment-based (EI) models, respectively. Please refer to Supporting 
Information, Text SI, for the definition of the mean energy increment and for the details of each model. 
The AAIC values and the ML estimates for these models with various sets of parameters are also listed in 
Table [21 and Tables [S2] and IS3[ respectively. Comparisons of AAIC values between the models in Table 
[2] indicate that the selective constraints on amino acids representing conservative selection against amino 
acid substitutions significantly improve the AAIC values of all substitution matrices. It is also indicated 
that the Miyata's physico-chemical distance performs better in all parameter sets than the Grantham's 
distance. This result is consistent with that of Yang et al. [7| for mitochondrial proteins. The present 
physico-chemical evaluation of selective constraints (EI models) fits JTT and WAG even better than the 
Miyata's distance scale, although the performances of both the methods are almost same for cpREV and 
mtREV. One of the important facts in these results is that allowing multiple nucleotide changes in a 
codon significantly improve the AIC irrespective of the estimations of selective constraints; compare the 
AAIC values between the Grantham-10 and the Grantham-11, between the Miyata-10 and the Miyata-11, 
and between the EI-10 and the EI-11. 



The effects of multiple nucleotide changes on ML estimations 

In principle, all parameters {wab} for selective constraints can be optimized in the case of codon sequences. 
In the case of protein sequences, all 190 non-diagonal elements of w in addition to the parameters for 
mutational tendencies at the nucleotide level and others cannot simultaneously be optimized; the number 
of freedoms in a general reversible model for an amino acid transition matrix is equal to 209. 
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In order to see how well amino acid substitution matriees can be explained with the assumption of 
successive single nucleotide substitutions, let us optimize Wat corresponding to single-step amino acid pairs 
by assuming that only single nucleotide mutations are possible, i.e., by m^tcllag] ^ with m^ri / 'm'[tc][ag] = 
constant in Eq. [51 The number of Wab for the single-step amino acid pairs is equal to 75 in the case 
of the universal genetic code. All 75 Wab for the single-step amino acid pairs have been optimized for 
each of JTT and WAG together with the nucleotide exchangeabilities {m^^}, the equilibrium nucleotide 
composition {/"™*}, the codon usage parameters {/^''^^''} and the scale parameter <t; the total number 
of the parameters is equal to 87 in addition to the 19 amino acid frequencies and the shape parameter r. 
This maximum likelihood model to estimate the matrix w is called ML with a suffix meaning the number 
of ML parameters; see Table [T] The ML estimates of these parameters except liab for the ML-87 are 
listed in Table [3] for JTT and WAG. 

In the lowest rows of this table, the ratio of the total nucleotide substitution rate per codon to the codon 
substitution rate, which represents the average number of nucleotide changes for substituting a codon, 
the ratio of the total transition to the total transversion rate per codon, and the ratio of nonsynonymous 
to synonymous substitution rate per codon are listed for the models. The sum of the total transition and 
the total transversion rates per codon is equal to the total nucleotide substitution rate per codon. The 
lowest three rows list their values in the case of cr — ^ and Wab = 0, and the second lowest three rows for 
the case of a — >■ 0. Thus, the differences of their values between the lowest and second lowest three rows 
represent the effects of selective constraints on amino acids (wab), and those between the second lowest 
and the third lowest three rows describe the effects of rate/time variations on the substitution matrix. If 
codon substitutions proceed by successive single nucleotide changes, i.e., 'rn[tc][ag] ^ 0, then the ratio of 
the total nucleotide to the codon substitution rate will be equal to 1 in the case of tr — > 0. 

Here it should be noticed that the nonsynonymous and the synonymous substitution rates are defined 
not to be rate per site but simply rate per codon. The sum of the nonsynonymous and the synonymous 
substitution rates is equal to the codon substitution rate. The ratio of the nonsynonymous to the syn- 
onymous substitution rate per codon does not corresponds to the ratio of nonsynonymous to synonymous 
substitutions per site, Ka/Ks [39], but the ratio of nonsynonymous to synonymous substitutions per 
codon, Ma/Ms [39]. The ratio (Na/Ns [39]) of the effective number of nonsynonymous sites to that of 
synonymous sites per codon corresponds to the ratio of nonsynonymous to synonymous rate in the case 
of no selective constraints {wab = 0). In the present models, Ka/Ks indicating the effects of selection 
on amino acid replacements corresponds to the nonsynonymous to synonymous substitution rate ratio in 
the case of cr divided by that in the case of Wab = and tr — > 0. Table [3] indicates that selection on 
amino acids is conservative, because the ratio of nonsynonymous to synonymous rate per codon is much 
smaller in the case of cr — )■ than in the case of Wab = and tr — >■ 0. 

As expected, the AIC value drastically decreases from that of the EI-14 in both cases of JTT and 
WAG, indicating that the introduction of many parameters may be still appropriate. However, there are 
large discrepancies between the observed transition matrix and the one estimated by the ML-87. Let us 
see the discrepancies between them in terms of log-odds. 

A log-odds matrix introduced by Dayhoff et al. [4] is one of the representations of amino acid 
substitution propensities. The (k. A) element of the log-odds matrix is defined to be the logarithm of 
odds to find an amino acid pair (k. A) in comparison with random sequences. The odds O^x is equal to 
the (k. A) element of transition matrix divided by the amino acid composition fx- 

0{S{t))^x = S{t)^x/fx (27) 

\og-OiSit))^x ^ -^logO{Sit)),x (28) 
log 10 

The proportional constant in Eq. [21] is the one originally used by Dayhoff et al. [3] . 

In Fig. [TJ the log-odds log-0{{S){t))ab corresponding to the 1 PAM transition matrix of the ML-87 
model fitted to JTT are plotted against those calculated from JTT. Plus, circle and cross marks show 
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the log-odds for one-, two-, and three-step amino acid pairs, respectively. Although the estimated values 
of log-odds for one-step amino acid pairs are almost exactly equal to those of the JTT matrix, there are 
still large discrepancies between the log-odds values for two- and three-step amino acid pairs, indicating 
a non-stepwise manner of codon substitutions. Similar discrepancies are also found in Fig. [ST]for WAG. 

We have examined how the AIC is improved by enabling multiple nucleotide changes in a codon. The 
selective constraints {wab} for multiple nucleotide changes are classified into 6 groups according to the 
amounts of discrepancies between the observed and the estimated values of the log-odds as shown in Fig. 
[TJ Then, the ML estimates of 94 parameters including 7 additional parameters, Wab for the 6 groups 
of multiple nucleotide changes and the parameter ni[tc][ag] for the rate of multiple nucleotide change, 
are calculated. This model is called ML-94. Also, the values of {wab} for multi-step amino acid pairs 
are calculated by maximizing the likelihood with fixing the values of all other parameters including Wab 
for the single-step amino acid pairs; this model is called here ML-94-1- by appending the mark. It 
should be noted that these values of Wab for the multi-step amino acid pairs in the ML-94-1- are not ML 
estimates at all. The ML estimates Wab for single-step amino acid pairs, the classification of multi-step 
amino acid pairs into the 6 groups, and the ML estimates for those categories of Wab are provided in 
Supporting Information, Data SI. As shown in Table [3l the ML estimates of m^ri, and f^^'^^'^ for 

the ML-87 model are very different from those for the ML-94, and some of them for the ML-87 seem 
to be unrealistic. For example, 'rhta/'nT'itc][ag] is evaluated to be smaller than 0.1. Also, the small value 
of ft+^° indicates the extremely biased usage of codons. The ML estimate (t of a F distribution is too 
large. These parameters are forced in the ML-87 to take such values to reduce the discrepancies between 
the observed and the estimated counts for multi-step amino acid pairs. In the ML-94 model, the ML 
estimators of these parameters take more reasonable values. However, it may also yield unreasonable 
estimates for codon usage parameters, {/^'^''^''}; for example, f^^^° = 0.221 in the ML-94 for WAG, and 
jusage ^ Q . /^"^age ^ q -^^ ^j^^ ML-94 for LG. Thus, the ML-91 model with /^""""sc ^ q ^^ich 
means equal codon usage, may be better than the ML-94. The ML-91 model was applied for JTT, WAG, 
and LG, and the ML estimates for them in the ML-91 are also listed in Table [31 

The ML estimators m^r;, /™"*, and a show a similar tendency between the ML-91 models for all 
the amino acid substitution matrices, i.e., JTT, WAG, and LG. The parameter rn[tc][ag] for multiple 
nucleotide changes and the scale parameter a for rate variation are both significant for all the matrices. 
The values of Thtc\ag/^[tc][ag] > 1 for JTT, WAG, and LG indicate that the mean exchangeability of the 
transition type is larger than that of the transversion type in all the matrices. 

As shown in Fig. [1] for JTT and in Fig. [ST] for WAG, the large discrepancies of the log-odds for the 
multi-step amino acid pairs disappear in the ML-91, in which multiple nucleotide changes are taken into 
account. The AIC values of JTT and WAG are significantly improved by enabling multiple nucleotide 
changes in the ML-91. This fact confirms that multiple nucleotide changes are statistically significant 
and should be taken into account to build a codon substitution model. 



ML estimation for the KHG codon substitution matrix 

If a codon substitution matrix is used for model fitting with the assumption of multiple nucleotide changes, 
all 190 parameters of selective constraints {wab} will be able to be optimized. The ML-200 model has 
been fitted to the 1-PAM codon substitution frequency matrix of KHG, which was empirically estimated 
without any restriction on multiple nucleotide changes |14| . 

The log-odds values for the codon pairs requiring single, double, and triple nucleotide changes are 
shown in Figs. [2j3, and [2lC, respectively. In these figures, upper triangle, plus, circle, and cross 
marks show the log-odds values for synonymous pairs and one-, two-, and three-step amino acid pairs, 
respectively. The dotted line shows the line of values where the observed and the estimated values 
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of log-odds are equal to each other. The log-odds of the codon pairs requiring single/double/triple 
nucleotide changes for one/two/three-step amino acid pairs respectively tend to fall along the dotted line 
in comparison with the log-odds of the other codon pairs. In other words, the log-odds of the codon pairs 
for which any nucleotide change is accompanied by an amino acid change arc correctly estimated. On 
the other hand, the estimated log-odds values do not well agree with the observed ones for synonymous 
codon pairs shown by the upper triangles. These estimated log-odds can be adjusted only by changing 
nucleotide mutation rates, i.e., m^j, and /™"*. Thus, the approximations of the independence and of no 
difference of nucleotide exchangeabilities between nucleotide positions may be limited; see Eq. [1] 

The codon pairs, whose log-odds values are less than —30 and which require more nucleotide changes 
than the least nucleotide changes required for the corresponding amino acid pair, tend to be located in 
the upper region than in the lower region of the dotted line; see plus marks in Fig. [33 and plus and circle 
marks in Fig. [^JC. Such a tendency is more clear in Fig. in which plus and circle marks corresponding 
to one- and two-step amino acid pairs are mostly located far from and almost in parallel to the dotted 
line. The estimated values of the log-odds for these one- and two-step amino acid pairs are greater by 10 
-15 than the observed values. 

In Fig. [2jl), the log-exchangeabilities of the codon pairs requiring triple nucleotide changes in the 1- 
PAM KHG matrix are plotted against their log-odds of the 1-PAM KHG matrix. The log-exchangeability 
is defined here to be (10/ log 10) log[i?J^^'^ • ti-pAM//i/]- The log-exchangeabilities of the codon pairs 
corresponding to three-step amino acid pairs are all nearly equal to their log-odds. The smallest log- 
exchangeabilities of these codon pairs reach almost —40. However, there are many codon pairs whose log- 
exchangeabilities are smaller than —40, and all of them correspond to one- or two-step amino acid pairs. 
The log-exchangeabilities of these codon pairs are significantly smaller than their log-odds, indicating that 
almost all substitutions of these codon pairs were estimated in KHG not to occur by triple nucleotide 
changes but rather by successive single or double nucleotide changes. 

In the present model, codon exchangeabilities are approximated by the product of nucleotide ex- 
changeabilities; see Eq. [1] for the exact expression. Therefore, all codon exchangeabilities for triple 
nucleotide changes are in the same order of magnitude, and specific codon pairs cannot be significantly 
less exchangeable. Thus, the present approximation for codon exchangeabilities may have a limitation, 
unless those exchangeabilities of KHG are underestimated. Estimation of the exchangeabilities for those 
codon pairs, which require more nucleotide changes than the least nucleotide changes required for the 
corresponding amino acid pair, may be less reliable than for the others. 

The ML estimates to , /™"' and a for KHG are listed in Table [S] The scale parameter u of the 
r distribution is estimated to be 0.0 for KHG, meaning that variations in rates need not be taken 
into account for KHG. There is a different tendency in the {to^,;} between KHG and the amino acid 
substitution matrices. One remarkable difference between them is that the parameter 'mtc\agl"'n{tc\\ag\ 
for transition-transversion bias is estimated to be greater than one in the ML-91 for JTT, WAG, and 
LG but to be less than one in the ML-20Q for KHG. This estimation of transition to transversion bias 
for KHG results from a fact that the ratio of the total transition to the total transversion substitution 
rate is actually equal to 0.765 in KHG, although this fact is contrary to the common understanding of 
transition-transversion bias. Because selective constraints on amino acids more favor transitions than 
trans versions, transition-transversion bias in nucleotide mutation rates for KHG must be much less than 
0.765. Actually the ratio of the total transition to the total transversion mutation rate is estimated to 
be 0.427; see Tabled 
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Comparison of ML estimates Wab among the present models 

In Table HI the correlation coefficients of Wab between the present models are listed. The lower half 
of the table lists those for single-step amino acid pairs, and the upper half lists those for multi-step 
amino acid pairs by excluding the amino acid pairs that belong to the least exchangeable class at least in 
one of the models. Each model name of JTT/WAG/LG-ML91-f and KHG-ML200 means the empirical 
substitution matrix and the method used to estimate selective constraints, Wat- In the following, these 
ML estimates of Wab wih be specified as ^£,Jtt/wag/lg-ml9i+ ^^^^ ^khg-ml2oo^ ^j^g method, 
selective constraints are approximated by a linear function of the energy increment due to an amino acid 
substitution, Aejjj, + Ae^j,, which is defined by Eqs. Sl-4, Sl-5, and Sl-6 in Supporting Information, Text 
SI; therefore, w^l ^ -(Ae-, + Ail,). 

The correlations of the ML estimates {wab} between the JTT-ML91-H, the WAG-ML91-H, and the 
LG-ML91-I- are very strong even for the multi-step amino acid pairs. Comparisons of the ML estimates 
of selective constraints between various models are shown in Fig. IS2I The estimated from 

the KHG codon substitution matrix are less correlated with |^-|^t/wag/lg-ml91+| ^^^^ ^^le other amino 
acid substitution matrices, especially less for the multi-step amino acid pairs. The ML estimates {—Wab} 
for the multi-step amino acid pairs are relatively smaller in the KHG-ML200 than in the JTT/WAG/LG- 
ML91+ models; see Fig. EH 

The correlations of {wab} between the EI and others are not as good as those between the other 
estimates, but they are significant especially between the EI and the KHG-ML200 even for the multi-step 
amino acid pairs. In Fig. [3K, the ML estimates {-wlJ'^~^^^^~^} in the JTT-ML91-h are plotted against 
the energy increments {—w^b} to an amino acid substitution; the least exchangeable category of 
multi-step amino acid pairs are not shown in this figure. Similar plots for the WAG-ML91-I- and for 
the LG-ML91-h are shown in Fig. [SH The ML estimates {_iyKHG-ML200| f^j. g^j^ amino acid pairs in 
the KIIG-ML200 are plotted against the energy increments {—w^i,} ™ ^'^S- 133 • No drastic difference in 
the correlation between these two quantities is found among one-, two-, and three-step amino acid pairs. 
The correlations of {wab} between the EI and the other models are better for the ML-91 than for the 
ML-87; the correlation coefficient between them for the single step amino acid pairs is equal to 0.19 for 
the JTT-ML87 but 0.66 for the JTT-ML91 and 0.30 for the WAG-ML87 but 0.68 for the WAG-ML91. 
The ML estimates {~Wab} for the single step amino acid pairs are compared between the ML-87 and the 
ML-91 models in Fig. [Ml 

In the next section, we will examine whether the differences among these estimates of Wab are signifi- 
cant in representing selective constraints on amino acids. 



Performance of the ML estimates {wab} and the characteristics of nucleotide 
mutations estimated 

The present model for codon substitutions is designed to separate selective pressures at the amino acid 
level from mutational events at the nucleotide level. Both unequal usage of degenerate codons and 
different rates of transition and transversion are characteristic of a genetic system specific to each species 
and each organelle. On the other hand, the relative strengths of selective constraints on amino acids 
would be far less specific to each species and each protein than each type of amino acid, although the 
mean strength of the selective constraints is specific to each protein family. Thus, we tried to approximate 
selective constraints {wab) for empirical substitution matrices including cpREV and mtREV by a linear 
function of those (wab) estimated from each of JTT, WAG, LG, and KHG; ^JTT/wag/lg-ml91+ 
w™G-ML2oo ^gg^^ ^estimate ^ ^g^u ^j^^gg models JTT/WAG/LG-ML91-(- or KHG- 
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ML200, which mean the empirical substitution matrix and the model used to estimate uj^stimate^ with 
a suffix meaning the number of ML parameters; see Tabled] 

In Table [5l the ML values for these models with the various sets of parameters are listed for all 
empirical substitution matrices. The ML estimates in the JTT/WAG/LG-ML91+-11 and the KHG- 
ML200-11 models are hsted in Tables H E and [S] The JTT-ML91+-0, the WAG-ML91+-0 and the 
LG-ML91+-0 models are the codon-based models corresponding to the JTT-F, the WAG-F and the LG- 
F amino-acid-based model, respectively, in which the JTT, the WAG and the LG rate matrices with an 
adjustment for the equilibrium frequencies of amino acids are used as a substitution rate matrix, because 
all 11 parameters of m^^, /™"*, and a are fixed to the values of their ML estimators in the ML-91+ for 
JTT, WAG, and LG; /? = 1 and iuq = are assumed, However, a critical difference is that a genetic 
code cannot be taken into account in the JTT/WAG/LG-F but in the JTT/WAG/LG-ML94+-0. This 
difference between both models can been clearly seen in the present models applied to mtREV, because 
a non-universal genetic code is used in the vertebrate mitochondrial DNA. The AAIG is improved from 
435.6 in the JTT-F to 426.0 in the JTT-ML91-I— 0. This indicates an advantage of the present mechanistic 
model to the empirical amino acid substitution model. 

The AIG values of the JTT/WAG/LG-ML91+-0 are better for all the four matrices (JTT, WAG, 
cpREV, and mtREV) than those of the physico-chemical method ELll; compare Tables [2] and [5j The 
AIG values of the KHG-200-0 are better for aU except for JTT than those of the EI-ll. The AIG values 
of all the models are drastically improved for all the matrices by optimizing the 11 parameters; see Table 
El It is noteworthy that all the models of the JTT-ML91-h-ll, the LG-ML91-h-ll, and the KHG-ML200- 
11 yield a better AIG value for WAG than the ML-87 model does, rejecting the null hypothesis of no 
multiple nucleotide change again; see Tables [3] and [5] Thus, the ML estimates y)JTT/WAG/LG-ML9i+ 
^KHG-ML200 sufficiently represent selective constraints on amino acid substitutions. 

In addition. Table [5] indicates which parameters are the most effective for improving AIG. As well as 
the EI models, the JTT/WAG/LG-ML9H— 7, in which the parameters m^^ are fixed to the ML estimates 
for JTT/WAG/LG with a certain ratio of transition to transversion exchangeability, can improve the AIG 
up to the similar degree to the AIG values of the JTT/WAG/LG-ML91+-11, respectively. In other words, 
the parameters {/|""*} are very effective to improve the AIG in comparison with the parameters {m^^}. 

The log-odds values of amino acid pairs estimated by the KIIG-ML200-11 are plotted against their 
empirical values for the 1-PAM amino acid substitution matrices of JTT, WAG, LG, and mtREV in Fig. 
m Similar plots are shown in Figs. [S5] - ISIOI The comparisons of Fig. [1] and Fig. EI] for the ML-87 
model with Fig. |3]and Fig. [S5] clearly indicate the good qualities of the ML estimators y)KHG-ML200 
^JTT/WAG/LG ML91+ j^Q^g^j^jy^jy Jaj-gQ disagreements between empirical and estimated log-odds exist for 
cpREV and mtREV in comparison with those for JTT, WAG, LG, and the KHG-derived amino acid 
substitution matrix (KHGaa); see Fig. |4]and Figs. [S5]-[S7l It is unknown whether the disagreements 
shown in these figures represent meaningful features in the amino acid substitutions in the chloroplast 
DNA and the mitochondrial DNA or result from the relatively small size of sequence data used for 
cpREV and mtREV. However, the large disagreements in the region of low log-odds values may be 
artifacts, because cpREV and mtREV tend to include relatively large errors in this region, especially for 
mtREV; the log-odds values for mtREV whose values are smaller than about —47.8 are all assumed to 
be —47.8; see the original paper [6]. 

The ML estimates of 1//3 listed in Tables |6l [71 and |8l indicate that the strength of selective constraints 
on amino acids is strong in the order of LG, WAG, and JTT. The strength of selective constraints is also 
shown by the change of the ratio of nonsynonymous to synonymous rate per codon between the two cases 
without and with selective constraints, i.e., the cases oiwah = and cr — >■ 0, and ct 0. As already noted, 
the ratio of these values between the two cases represents the strength of selective constraints. In the 
KHG-ML200-11, these ratios are equal to 0.293/5.23 = 0.056, 0.577/5.35 = 0.11, and 0.499/3.71 = 0.13 
for LG, WAG, and JTT, respectively, meaning that the selective constraints of LG are strongest; it should 
be noted that this order agrees with the increasing order of 1//3. 
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Tables[n]and[7]indicatc that the selective constraints t£,KHG-ML200 estimated from the KHG codon sub- 
stitution matrix tend to estimate the contribution of muhiple nucleotide changes to be smaller, 
the ratio of transition to transversion exchangeability ('Tijc|ag/'7i[tc][ag]) to be smaller, 'Tita/''T^[tc][ag] to be 
larger, and variations in substitution rates (a) to be less than the u)JTT/wag/lg-ml91+ fj-Qj^ amino 
acid substitution matrices. Table [8] shows that the same characteristic differences will be observed if 
the JTT/WAG/LG-ML91+-11 models are fitted to the codon substitution matrix of KHG instead of 
its derived amino acid substitution matrix. Tables [51 [3 and [5] also show that the ratio of transition 
to transversion exchangeability (m-tciag/'^ltcllag]) tends to be estimated to be smaller in the order of the 
LG-ML91+, the WAG-ML91-I-, the JTT-ML91-I-, and the KHG-ML200. The m(c|ag/?7i[te][ag] is estimated 
by the ML-91 or the ML-200 model to be smaller in the order of LG, WAG, JTT, and KHG; see Table 
131 The present ML estimates {wab} for selective constraints on amino acids seem to reflect the charac- 
teristics of respective substitution matrices to which the models are fitted. It remains to be analyzed 
which estimation is better among the JTT/WAG/LG-ML91+ and the KHG-ML200 and how better it is. 
Irrespective of which estimation of the selection constraints is better, the ML estimates mtc\ag / 'nT'[tc][ag] 
indicate that the transition to transversion bias is not so strong as previously estimated. 

One of the interesting facts is that the ratio of the total transition to the total transversion rate per 
codon will be estimated to be much larger if multiple nucleotide changes are neglected; '7T,t(,|ac//"^[tc][ag] 
(and the ratio of the total transition to the total transversion rate for cr — >■ 0) are estimated for the 
mtREV to be 2.15 (3.32) in the JTT-ML91-K-10 but 2.01 (2.52) in the JTT-ML9H-11, 4.27 (4.13) in the 
WAG-ML91+-10 but 3.43 (2.73) in the WAG-ML91+-11, 4.57 (4.74) in the LG-ML91+-10 but 3.82 (3.31) 
in the LG-ML91-h-ll, and 1.81 (2.58) in the KHG-ML200-10 but 1.64 (1.96) in the KHG-ML200-11. The 
same tendency is observed for JTT, WAG, cpREV, and mtREV irrespective of the matrices, and for the 
EI, the Miyata, and the Grantham models irrespective of the models. 

In the case of mtREV, not only the transition-transversion exchangeability bias {rhtciag/TT^ltcllag]) but 
also the ratio of the total transition to the total transversion rate per codon is larger in the JTT/WAG/LG- 
ML914-11 than in the JTT/WAG/LG-ML91+-0, and in the KHG-ML200-11 than in the KHG-ML200-0. 
Also, the JTT/WAG/LG-ML91+-11 and the KHG-ML200-11 models estimate mtc\ag/m[tc][ag] and the 
ratio of the total transition to the total transversion rate to be larger for mtREV than for JTT, WAG, 
and cpREV. These results are consistent with a well-known fact that transition to transversion bias is 
larger in mitochondrial DNA than in nuclear DNA. 



Discussion 

Halpern and Bruno [40] considered a codon-substitution model in which site-specific selection is taken 
into account in terms of residue frequencies. If site-specific codon frequencies are explicitly taken into 
account in the present model, the substitution rate Rf^i, will be regarded as the average of the site- 
specific rate over sites i. According to Eq. [3 the site-specific rate is defined as the product of 

site-independent mutation rate Mf^^ and site-dependent fixation probability, (/^//™"*)e"'*"'. 



i?;, = Const AV-^^e"'- ioT^i^,y (29) 



Here the site-dependency of the fixation probability is taken into account only in terms of codon frequen- 
cies. Then, the average of the site-specific rate over sites is calculated as follows. 

R^. = Cor^st li\r - Const AVl^rrFg""^ ioitl^V (30) 



(31) 



16 



where is the average of /* over sites. Thus, the w^i^ defined here ineludes the effects of site-specific 
selection in terms of codon frequencies. 

In the model of Halpern and Bruno |40| . the term of e^f" was not distinguished from and merged 
with the mutation rate M^^; that is, e"'^"' = constant for ^ v was assumed, Yang and Nielsen [22) 
considered mutation-selection models of codon substitutions and estimated selective strengths on codon 
usage. In their models, selection pressures that deviate codon frequencies from the equilibrium codon 
frequencies at the mutational level were explicitly taken into account, and selective constraints on amino 
acids are assumed to be constant over amino acid pairs; that is, e™"'' = constant for a ^ h was assumed. 
However, the site-specific selection was not considered; that is, = f^. In other words, unlike the 
present model, selection was taken into account principally in terms of codon or residue frequencies in 
both the models. Also, multiple nucleotide changes were not taken into account. Halpern and Bruno 
[10] developed their model for distance calculation. As pointed out by Yang and Nielsen [52], taking 
account of site-specific codon frequencies is not practical for real data analysis due to the use of too many 
parameters. Instead, the use of w^^ is more practical. The present results show that the ML values of 
the JTT/WAG/cpREV/mtREV amino acid substitution matrices are too small in the No-Constraints 
models in which Wab = is assumed, and they can be improved by taking account of the term of the 
selective constraints e""''". Also, it is indicated that selective constraints on amino acids strongly depend 
on the type of amino acid. 

In some previous models [71|T71[TB], amino acid substitutions were assumed to proceed in a stepwise 
manner by successive single nucleotide changes in a codon. The empirical amino acid substitution matrices 
of JTT, WAG, LG, cpREV, and mtREV, and the codon substitution matrix KHG all include many 
substitutions between amino acid or codon pairs requiring multiple nucleotide changes. Significance of 
multiple nucleotide substitutions was pointed out [TJHHHnillZlllS] • There are two possible mechanisms 
to yield substitutions between such multi-step amino acid pairs even for a short time interval. One is 
variations in substitution rates or time intervals. Another is multiple nucleotide changes in a codon. 
Here, the assumption of multiple nucleotide changes has been directly introduced into a codon-based 
substitution model together with the use of a F distribution for variations in substitution rates and time 
intervals, and the effectiveness of the assumption has been examined. 

In the models using any physico-chemical evaluation of selective constraints, the significance of mul- 
tiple nucleotide changes has been indicated; see Tables [5] and [31 The ML-87 models fitted to JTT and 
WAG, in which the selective constraints {wab} for all single-step amino acid pairs are optimized by max- 
imizing the likelihood with the assumptions of no multiple nucleotide change for codon substitutions and 
of variations in substitution rates, reveal that large discrepancies between the observed and the estimated 
log-odds values remain for multi-step amino acid pairs; see Fig. [T] When multiple nucleotide changes are 
taken into account in the model ML-91, these discrepancies disappear and the AIC values significantly 
decrease, indicating the significance of multiple nucleotide changes in codon substitutions; see Fig. (T] 
Fig. Eil and Tabled 

Evidence for multiple nucleotide changes was found by Averof et al. [27], and the frequency of 
multiple nucleotide changes was evaluated [20]. On the other hand, a possibility for successive single 
compensatory substitutions was pointed out by Bazykin et al. [29]. As pointed out by Kosiol et al. 
[Tl] . the high exchangeabilities of the double nucleotide changes, Rcgt <-> Ragg and Rcgt <-> Raga, in 
KHG may result from successive single compensatory substitutions. On the other hand, a selection on 
synonymous substitutions is necessary for compensatory substitutions to cause the higher exchangeability 
of Rcga -H- Ragg than estimated, because the most probable paths of single nucleotide changes between 
Rcga and Ragg are Rcga O Raga O Ragg and Rcga O Rcgg -H- Ragg both of which do not accompany 
any amino acid change; see Fig. [5] Whatever causes multiple nucleotide changes, the present scheme for 
codon substitutions could be applied to phylogenetic analyses of protein-coding sequences, because the 
underlying time scale in the present substitution model is much longer than that of positive selection for 
successive single compensatory substitutions. 
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The models JTT/WAG/LG-ML91+-0 and KHG-ML200-0, in which parameters are taken to be equal 
to the ML estimates for JTT/WAG/LG in the ML-91+ model and the ML estimates for KHG in the ML- 
200 model, are codon-based models corresponding to the JTT/WAG/LG/KHG-F model, respectively. 
The model ML-91+ can almost perfectly reproduce JTT, WAG, and LG. The model ML-200 for the KHG 
codon substitution matrix can well reproduce the codon substitution probabilities for the codon pairs for 
which any nucleotide change is accompanied by an amino acid change, although the exchangeabilities of 
the other codon pairs are over-estimated for KHG. This means that the JTT/WAG/LG-ML91+-0 and the 
KHG-ML200-0 models can be used as a simple substitution model without any loss of information instead 
of the empirical substitution matrices of the JTT/WAG/LG/KHG in maximum likelihood and Bayesian 
inferences of phylogenetic trees of amino acid and codon sequences, respectively. Although the empirical 
substitution matrices represent the average tendencies of substitutions over proteins and species and 
may lack gene- level resolution jl51ll6| . the present mechanistic codon model has adjustable parameters 
for nucleotide mutation and for the strength of selective constraints, which can be tailored to specific 
genes. It is possible to optimize the selective constraints {wab} for each gene. However, such a method 
[T2lll5ll6| is far more computer- intensive than the present method. The present methods, JTT/WAG/LG- 
ML91-t-n using u,JTT/wag/lg-ml91-h ^^^^ KHG-ML200-n with the it,KHG-ML200^ provide alternative 
models for amino acid/codon substitutions with a small number of ML parameters in the probabilistic 
inference of phylogenetic trees. The number of ML parameters specific to the present model is at most 6 
exchangeabilities and 3 equilibrium frequencies for nucleotide mutations, and 2 parameters for selective 
constraints. Thus, the present model requires the same order of cpu time as the nucleotide substitution 
model (GTR) does. In other codon models pTl[23] . exchangeabilities between amino acids are taken to be 
equal to their values in empirical amino acid substitution matrices. However, in the present codon model, 
amino acid and codon exchangeabilities vary according to nucleotide mutation rates and the strength of 
selective constraints. 

The parameters m^^, and a are differently estimated by the KHG-ML200-n and the JTT/WAG/LG- 
ML91-I— n using different w; see Tables [H [71 and H The u,khg-ml200 yj^j^jg ^ smaller rate of multiple 
nucleotide changes, a smaller a, a smaller ratio of transition to transversion exchangeability, and a smaller 
ratio of nonsynonymous to synonymous rate per codon than the tyJTT/WAG/LG-ML9i-i- (jQgg_ Whichever es- 
timation is better, the present ML estimators TOtc|ag/™[tc] [ag] ioi transition-transversion bias strongly indi- 
cate that the transition-transversion bias is not so large as previously estimated. An excess of transitional 
over transversional substitutions was shown in the DNA sequences of metazoa, and has been assumed to 
be universal. However, Keller et al. j41j found a counter example to the transition-transversion bias from 
grasshopper pseudogenes. The present ML estimate of the ratio of transition to transversion exchange- 
ability for the KHG codon substitution matrix is rather less than 1.0, i.e., mtc\ag / '>TT-[tc][ag] = 0.843 in the 
ML-200 model, which corresponds to the overall rate bias of transitions over transversions, 0.427. Even 
for the amino acid substitution matrices JTT, WAG, and LG, the ML-91 model estimates rn-tc\ag/i^[tc][ag] 
to be less than 1.9, making the overall rate bias of transitions over transversions less than 1.0; see Table[3l 
It should be noted that the ratio of transition to transversion exchangeability tends to be overestimated 
if no multiple nucleotide change is allowed; see Tables [S2] and [S3l Thus, the present results indicate that 
transition-transversion bias is not a solid assumption. On the other hand, the present results indicate 
that transition-transversion bias is stronger in mitochondrial DNA than in nuclear DNA in accordance 
with previous understanding; see Tables [6] and [T] 

The ML estimates |^J^t/wag/lg-ml91+| |^khg-ml200| significantly correlate with each other 
and also with the mean energy increments due to an amino acid replacement. However, the JTT/WAG/LG- 
ML91+-n and KHG-ML200-ri models fit substitution data significantly better than the El-n model; see 
Tables [2] and [5] This fact indicates that the differences between the physico-chemical estimates and the 
ML estimates {wab} for selective pressure at the amino acid level reflect the actual tendency of selective 
constraints for respective types of amino acid pairs in protein evolution. Eq. [31] indicates that the w 
is modulated by site-specific codon frequencies and differentiated from the site-independent constraints. 
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w*, which may be more similar to the physico-chemical estimates than the w. The selective constraints 
estimated here may be used as a base line to detect evidence of positive selection. Models [^OlIU] in 
which the dependences of selective constraints on amino acid pairs are not taken into account may be 
improved by introducing them. On the other hand, it still remains to be examined whether or not 
the JTT/WAG/LG-ML91+-n and the KHG-ML200-n perform comparably with cpREV for the max- 
imum likelihood inferences of phylogcnetic trees of chloroplast proteins and with mtREV for those of 
mitochondrial proteins. Also, it should be examined which performs better. 

A preliminary calculation has been pursued to examine the performance of the present substitution 
models in the ML inference of a pliylogenetic tree. Log-likelihoods of the present models and the codon 
models corresponding to the mtREV-F, the JTT-F, the WAG-F, and the LG-F are calculated and listed 
in Table IH] for a phylogcnetic tree [B] of the concatenated sequences of 12 protein-coding sequences 
encoded on the same strand of mitochondrial DNA from 20 vertebrate species with 2 races from human. 
The phylogcnetic tree and the proteins used are those which Adachi and Hasegawa [B] used to estimate 
mtREV; the Japanese mtDNA was not used because it couldn't be found in the GenBank database. 
The coding sequences of each protein were aligned with codon score matrices by the ClustalW2 [42], 
and then concatenated. Their likelihoods on the phylogenetic tree were calculated by the Phyml [43] . 
Both the programs have been modified for the analysis of coding sequences. Log-odds calculated by the 
KHG-ML200-11 fitted to mtREV were used as the codon score matrices. Positions with gaps are included 
for the calculation of the likelihoods. The codon substitution matrices corresponding to mtREV, JTT, 
WAG, LG, and the KHG-derived amino acid substitution matrix (KHGaa) are calculated in such a way 
that codon exchangeabilities for nonsynonymous codon pairs are taken to be equal to exp wq multiplied 
by the exchangeability of the corresponding amino acid pair and those for synonymous codon pairs are 
assumed to be all equal to the mean amino acid exchangeability. In all models, the parameter wq in Eq. 
[TT]was optimized even for the No- Constraints models, and codon frequencies were taken to be equal to 
those in coding sequences. The substitution matrices, JTT, WAG, LG, and KHG were estimated from 
nuclear DNA, which use a different genetic code from vertebrate mtDNA. On the other hand, mtREV 
was estimated by a maximum likelihood method from the almost same set of the protein sequences 
encoded in mtDNA. Thus, it is expected that the log-likclihood values of the mtDNA phylogcnetic tree 
for the models, KHGaa-l-F, LG-l-F, WAG-IF, and JTT-l-F are worse than that for the mtREV-l-F. 
An important thing is that the codon models with the selective constraints estimated from nuclear DNA 
or by the physico-chemical method yield a much smaller value of AIC than the mtREV-l-F. One of the 
effective parameters is wq that directly controls the ratio of nonsynonymous to synonymous substitution 
rate. It also improves the likelihood to explicitly take account of rate variations over sites. The discrete 
approximation [44] of the F distribution with 4 categories was used to represent rate variations over sites 
in the models named with the sufRx "dG4"; the shape parameter a is a ML parameter. An interesting 
and reasonable fact is that averaging substitution matrices over rate becomes unnecessary, i.e., a — 0.0, 
in the case that rate variations over sites are explicitly taken into account; in the Yang's model [26ll44] . 
the likelihood of a phylogenetic tree of each site is averaged over rate. Also, all the present codon- 
bascd models estimate TO[tc][ag] > 0.1, which indicates the significance of multiple nucleotide changes. 
The present results strongly indicate that the tendencies of nucleotide mutations and codon usage are 
characteristic of a genetic system specific to each species and oranelle, but the amino acid dependences of 
selective constraints are more specifc to each type of amino acid than each species, organelle, and protein 
family. Full evaluation will be provided in a succeeding paper. 

One may question whether the whole evolutionary process of protein-coding sequences can be ap- 
proximated by a reversible Markov process or not. Kinjo and Nishikawa [45] reported that the log-odds 
matrices constructed for 18 different levels of sequence identities from structure-based protein alignments 
have a characteristic dependence on time in the principal components of their eigenspectra. Although 
they did not explicitly mention, this type of temporal process peculiar to the log-odd matrix in protein 
evolution is fully encoded in the transition matrices of JTT, WAG, LG, and KHG. In Fig. ISlli it is 
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shown that this characteristic dependence of log-odds on time can be reproduced by the transition ma- 
trix based on the present reversible Markov model fitted to JTT; see Supporting Information, Text SI, 
for details. This fact supports the appropriateness of the present Markov model for codon substitutions. 
The present codon-based model can be used to generate log-odds for codon substitutions as well as amino 
acid substitutions. Such a log-odds matrix of codon substitutions would be useful to allow us to align nu- 
cleotide sequences at the codon level rather than the amino acid level, increasing the quality of sequence 
alignments. 

As a result, the present model would enable us to obtain more biologically meaningful information at 
both nucleotide and amino acid levels from codon sequences and even from protein sequences, because 
this is a codon-based model. 
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Figure 1. The ML-87 and the ML-91 models fitted to JTT. Each clement \og-0{{S){T,a))ab 
of the log-odds matrices of (A) the ML-87 and (B) the ML-91 models fitted to the 1-PAM JTT matrix 
is plotted against the log-odds log-0(S'''"'"^(l PAM))a& calculated from JTT. Plus, circle, and cross 
marks show the log-odds values for the types of substitutions requiring single, double and triple 
nucleotide changes, respectively. The dotted line in each figure shows the line of equal values between 
the ordinate and the abscissa. 
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Figure 2. The ML-200 model fitted to KHG. Each clement \og-0{{S){T,a))^^ of the log-odds 
matrix corresponding to (A) single, (B) double, and (C) triple nucleotide changes in the ML-200 model 
fitted to the 1-PAM KHG codon substitution matrix is plotted against the log-odds 
log-0(S'^"°(l PAM))^j. calculated from KHG. In (D), codon log-exchangeabilities of the 1-PAM KHG 
codon substitution matrix corresponding to triple nucleotide changes are plotted against the log-odds 
log-0(S'^H°(l PAM))^^. calculated from KHG. The log-exchangeability of the 1-PAM KHG is defined 
as (10/ log 10) log[i?™^ • ti.pAu/ fu]- Upper triangle, plus, circle, and cross marks show the log-odds 
values for synonymous pairs and one-, two-, and three-step amino acid pairs, respectively. 
Log-exchangeabilities for the codon pairs whose instantaneous rates are estimated to be in KHG are 
shown to be about —65 in this figure. The dotted line in each figure shows the line of equal values 
between the ordinate and the abscissa. 
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Figure 3. Selective constraint for each amino acid pair estimated from JTT and from 

KHG. The ML estimate, (A) _iuJTT-ml91+ ML-91+ model fitted to the 1-PAM JTT amino 

acid substitution matrix and (B) _u;KHG-ml200 ^^le ML-200 model fitted to the 1-PAM KHG codon 
substitution matrix, for each amino acid pair is plotted against the mean energy increment due to an 
amino acid substitution, (Aejj^ + Ae^^) defined by Eqs. Sl-4, Sl-5, and Sl-6 in Supporting Information, 
Text SI. In (A), the estimates Wab for the least exchangeable class of multi-step amino acid pairs are 
not shown. Plus, circle, and cross marks show the values for one-, two-, and three-step amino acid pairs, 
respectively. 
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Figure 4. The KHG-ML200-11 model fitted to each of JTT, WAG, LG, and mtREV. 

Each element \og-0{{S){T, d-))ab of the log-odds matrices of tlic KHG-ML200-11 model fitted to the 
1-PAM matrices of (A) JTT, (B) WAG, (C) LG, and (D) mtREV is plotted against the log-odds 
log-0(5'^^(l PAM))ab calculated from the corresponding empirical substitution matrices. Plus, circle, 
and cross marks show the log-odds values for one-, two-, and three-step amino acid pairs, respectively. 
The dotted line in each figure shows the line of equal values between the ordinate and the abscissa. The 
log-odds elements of mtREV whose values are smaller than about —47.8 are all assumed to be —47.8; 
see the original paper [6]. 
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Table 1. Brief description of models. 



Model name 


Description 


No-Constraints-n 


No amino acid dependences of selective constraints; /? = 0. The suffix n means 
the number of ML parameters. 


El-n 


^estimate ^ Ae^^ + Ae^jj based on the Energy-increment-based (EI) method, 
which is described in Supporting Information, Text SI, is used to estimate Wab 
in Eq. 1111 The suffix n means the number of ML parameters. 


Miyata-n 


The amino acid pair distance dab estimated by Miyata et al. |32| is used as 
= —dab to estimate Wab in Eq. [TT] The suffix n means the number of 
ML parameters. 


Grantham-n 


The amino acid distance dab estimated by Grantham [31] is used as = 
—dab to estimate Wab in Eq. 1111 The suffix n means the number of ML 
parameters. 


ML-n 


Selective constraints {wab} are estimated by maximizing the likelihood of JTT 
[5], WAG [10], or LG [n], and called {y;JTT/wAG/LG-MLn|_ ^j^^ ^^^^ ^ ^^^^^ 
the number of ML parameters. In the ML-87, multiple nucleotide changes are 
disallowed, and {wab} for all 75 single-step amino acid pairs are estimated. In 
the ML-91 and the ML-94, multiple nucleotide changes are allowed, and {wab} 
for all 75 single-step amino acid pairs and for 6 groups of multiple-step amino 
acid pairs are estimated. In the ML-91, equal codon usage is assumed. In the 
ML-200 for codon substitution matrices, {wab} for all 190 amino acid pairs are 
estimated. 


ML-n+ 


First, the ML-n is used to estimate parameters, and then {wab} for all multiple- 
step amino acid pairs are estimated by maximizing the likelihood with fixing 
all other parameters to the values estimated by the ML-n. 


JTT-ML91-n, 

WAG-ML91-n, 

LG-ML91-n 


Selective constraints j^^J^ ^/^^'^/^'^ ^^^^^j estimated by maximizing the like- 
lihood of JTT/WAG/LG [SKIOllIl] in the ML-91 model are used as {wlf"""^"} 
in Ec^. 111! The suffix n means the number of ML parameters. 


JTT-ML91+-n, 

WAG-ML91+-71, 

LG-ML91+-n 


Selective constraints {ui^fl i/wag/lg ml91-H| j,g|-jjj^g^|-j,(j ]-,y maximizing the like- 
lihood of JTT/WAG/LG [SlfTUUTT] in the ML-91-H model are used as {iy°|t™ate| 
in Eq. [TT] The suffix n means the number of ML parameters. The JTT/WAG/ 
LG-ML91-f-0 models correspond to the JTT/WAG/LG-F models, respectively. 


KHG-ML200-n 


Selective constraints {u;^^iiG-ML:iuu| estimated by maximizing the likelihood of 
the KHG codon substitution matrix [T3] in the ML-200 model are used as 
{^afc'""''*°} ^'i- EI "^^^ suffix n means the number of ML parameters. The 
KHG-ML200-0 models correspond to the KHG-F model. 



Table 2. AAIC values of the present models without and with the selective constraints 
on amino acids, which are based on mean energy increments due to an amino acid substitution (EI), 
the Miyata's and the Grantham's physico-chemical distances, for the 1-PAM amino acid substitution 
matrices of JTT, WAG, cpREV, and mtREV. 







AAIC ° 


Model 


^parameters 
(id no. 


JTT 


WAG 


cpREV 


mtREV 


No-Constraints- 












1 


21(/3 = 0, 3) 


86428.1 


37917.6 


3478.0 


2644.1 


10 


30(/3 = 0, 2-10,14) 


24595.6 


7719.1 


904.5 


901.0 


13 


33{/3 = 0, 2-14) 


22913.6 


7141.5 


874.9 


798.8 


EI- 












2 


22(1,3) 


77337.9 


35058.8 


3186.0 


2396.6 


2G 


22(1,14) 


24197.7 


5571.6 


974.0 


1066.8 


3 


23(1,3,14) 


16463.7 


4995.0 


761.5 


776.4 


4 


24(1-3,14) 


15808.7 


4443.6 


743.0 


753.9 


8 


28(1-7,14) 


15715.0 


4327.8 


722.0 


728.2 


7 


27(1-3,8-10,14) 


15081.0 


4312.6 


650.7 


688.7 


10 


30(1,3-10,14) 


15435.7 


4801.8 


670.7 


702.8 


lOM 


30(1-10) 


15270.7 


4250.4 


645.3 


674.3 


11 


31(1-10,14) 


14999.0 


4202.5 


636.0 


674.3 


lOMU 


30(1-3,8-14) 


13464.3 


3959.7 


578.9 


662.4 


12 


32(1,3-13) 


72316.3 


33908.4 


2939.7 


2215.0 


13 


33(1,3-14) 


13819.7 


4554.2 


623.6 


655.5 


13M 


33(1-13) 


13436.2 


3822.4 


551.1 


623.3 


14 


34(1-14) 


13151.9 


3748.0 


541.9 


614.8 


Miyata- 












4 


24(1-3,14) 


16090.1 


4938.1 


750.3 


783.0 


7 


27(1-3,8-10,14) 


15767.2 


4715.4 


654.5 


701.6 


10 


30(1,3-10,14) 


16446.1 


5124.9 


679.2 


708.5 


11 


31(1-10,14) 


15536.8 


4429.5 


628.4 


658.4 


13 


33(1,3-14) 


15058.2 


4943.1 


656.5 


682.3 


14 


34(1-14) 


14338.5 


4254.0 


603.7 


613.6 


Grantham- 












4 


24(1-3,14) 


20505.1 


5953.7 


916.4 


887.1 


7 


27(1-3,8-10,14) 


18898.2 


5814.0 


840.6 


832.9 


10 


30(1,3-10,14) 


18744.5 


5749.0 


805.4 


799.8 


11 


31(1-10,14) 


18680.9 


5579.7 


803.2 


796.5 


13 


33(1,3-14) 


16784.9 


5512.9 


765.0 


741.0 


14 


34(1-14) 


16729.7 


5477.1 


755.0 


739.5 



" AAIC = 2NiKL(0) + 2x #parameters with TV ~ 5919000 for JTT, N x 1637663 for WAG, N ^ 169269 for cpREV, and 
N 137637 for mtREV; see text for details. 

' ML parameters in each model are specified by the parameter id numbers in the parenthesis, and other parameters are 
fixed at ido = 0, idi = oo, id2 — > 0, ids-y = 1.0, idg-ia = 0.5, and idi4 — > 0. Each id number corresponds to the parameter 
id number listed in Table |3] 



Table 3. ML estimates and AAIC values of the present models for the 1-PAM amino 
acid substitution matrices of JTT, WAG, and LG, and the 1-PAM codon substitution 
matrix of KHG. 



id 
no. 



parameter 



JTT 



ML-87" ML-91" ML-94 



ML-S?" 



WAG 



ML-gi" 



ML-94 



LG 



ML-91° ML-94 



KHG 

(codon) 
ML-200 





1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 



-wo 

ITT-ltcllag] 

rriag/rhtc\ag 
mta/m[tc][ag] 

rhtg /rrntc][ag] 

rh.ca/m[tc]lag] 
f mut 

Jt + a 

smut I rmut 
St I /, 



t + a 

"mut / smut 
/Jc+g 



?usage 
h + a 

?usagc , i^usage 
Jt /Jt+a 

pusagc I rusage 
.'c / Jc+g 



N/A 
N/A 
H 0) 
0.0919 
1.77 
0.0293 
3.21 
0.719 
0.408 
0.113 
0.698 
0.0682 
0.461 
0.386 
27.3 



N/A 
N/A 
0.637 
1.57 
1.14 
0.729 
0.940 
1.19 
0.459 
0.501 
0.429 
(0.5) 
(0.5) 
(0.5) 
0.738 



N/A 
N/A 
0.662 
1.59 
1.15 
0.730 
0.950 
1.18 
0.446 
0.522 
0.436 
0.483 
0.491 
0.558 
0.740 



N/A 

N/A 
0) 
0.746 

1.98 
0.0477 

3.64 
0.110 
0.372 
0.234 
0.425 
0.0669 
0.330 
0.310 

43.3 



N/A 
N/A 
1.28 
1.70 
1.32 
0.791 
1.04 
1.23 
0.367 
0.587 
0.479 
(0.5) 
(0.5) 
(0.5) 
0.905 



N/A 
N/A 
1.29 
1.69 
1.31 
0.784 
1.01 
1.23 
0.392 
0.513 
0.471 
0.221 
0.429 
0.306 
0.840 



N/A 
N/A 
1.08 
1.85 
1.23 
0.676 
1.07 
1.28 
0.388 
0.450 
0.427 
(0.5) 
(0.5) 
(0.5) 
0.415 



N/A 
N/A 
1.19 
1.81 
1.21 
0.682 
1.07 
1.25 
0.403 
0.439 
0.383 
0.447 
0.555 
0.249 
0.395 



N/A 
N/A 
0.939 
0.843 
0.945 
1.52 
0.554 
0.573 
0.497 
0.513 
0.470 
NA 
NA 
NA 




#parameters 
iKLiO) X 10* 

AAIC " 



0.334 
107 
15695 
2072.0 



0.0243 
111 
638 
297.5 



0.0246 
114 
613 
300.6 



0.317 
107 
35319 
1370.8 



0.0223 
111 
1903 
284.3 



0.0207 
114 
1438 
275.1 



0.0246 
111 
2771 
782.5 



0.0240 
114 
2335 
700.4 



0.0240 
261 
269946 
unknown 



Ratio of substitution rates 
per codon 

the total base/codon 
transition / transvcrsion 
nonsynonymous /synonymous'^ 



1.28 
0.464 
1.13 



1.35 
1.08 
1.37 



1.35 
1.08 
1.34 



1.38 
0.482 
1.57 



1.53 
0.932 
2.07 



1.52 
0.806 
2.40 



1.38 
1.18 
1.05 



1.39 
1.20 
1.20 



1.29 
(1.29)^* 
0.764 
(0.765)'' 
0.726 
(0.723)'' 



Ratio of substitution rates 
per codon for cr — > 
total base/codon 
transition / transvcrsion 
nonsynonymous / synonymous*^ 



1.0 
0.101 
0.0644 



1.22 
1.21 
1.04 



1.22 
1.22 
1.02 



1.0 
0.647 
0.138 



1.38 
1.11 
1.50 



1.40 
0.932 
1.79 



1.31 
1.31 
0.853 



1.33 
1.35 
0.889 



1.29 
0.764 
0.726 



Ratio of substitution rates per 
codon for Wab = and ct — > 
total base/codon 
transition / transvcrsion 
nonsynonymous /synonymous*^ 



1.0 
0.0605 
11.3 



1.45 
0.829 
5.58 



1.46 
0.831 
5.74 



1.0 
0.499 
11.1 



1.72 
0.933 
8.68 



I. 74 
0.849 

II. 1 



1.67 
0.992 
7.45 



1.71 
0.981 
8.46 



1.51 
0.427 
6.81 



" If the value of a parameter is parenthesized, the parameter is not variable but fixed to the value specified. 

b iji^(e) = -(.«(0)/Af + 2.98607330) for JTT, -(£(0)/7V + 2.97444860) for WAG, -(^(0)/Af + 2.96853414) for LG, and 

-(^(0)/iV + 4.19073314) for KHG; see text for details. 

" AAIC = 2NiKL{0) + 2x #parameters with N ~ 5919000 for JTT, TV « 1637663 for WAG, N 10114373 for LG, and 
the value of A'^ is unknown for KHG; see text for details. 

The value in the parenthesis corresponds to the one for the KHG codon substitution probability matrix. 

Note that these ratios are not the ratios of the rates per site but per codon; see text for details. 



Table 4. Correlations of Wab between various estimates; the lower half shows the correlation 
coefficients of Wab for 75 single-step amino acid pairs and the upper half does those of Wab for 86 
multi-step amino acid pairs by excluding 29 amino acid pairs of the least exchangeable category in the 
JTT-ML91, the WAG-ML91 or the LG-ML91. 



Model 


EI 


JTT-ML91+ 


WAG-ML91+ 


LG-ML91+ 


KHG-ML200 


EI 




0.45 


0.51 


0.59 


0.55 (0.65)'' 


JTT-ML91-I- 


0.66 




0.80 


0.80 


0.51 


WAG-ML91+ 


0.68 


0.87 




0.86 


0.55 


LG-ML91+ 


0.71 


0.82 


0.90 




0.58 


KHG-ML200 


0.71 


0.77 


0.69 


0.74 





° The value in the parenthesis is the correlation coefficient for which the Wat for all multi-step amino 
acid pairs are taken into account. The correlation coefficient of Wat for all amino acid pairs between the 
EI and the KHG-ML200 is equal to 0.60. 



Table 5. AAIC values of the present models with the respective selective constraints on 
amino acids, u,Jtt-ml9i+^ ^wag-ml9i+^ ^lg-ml9i+^ ^khg-M2oo^ ^1^^ ^^^ous 1-PAM 

substitution matrices. 





^parameters 


AAIC 


IklW X 108 <^ 


Model name 


^parameters 
(id no. 


JTT WAG LG cpREV mtREV 


KHG 

(amino acid) 


KHG 

(codon) 


JTT-ML91+- 

1 
4 
7 
11 
12 


20 

21(14) 

24(1-3,14) 

27(1-3,8-10,14) 

31(1-10,14) 

32(0-10,14) 


2657.5 20807.0 461.7 426.0 
2065.1 20382.6 433.9 424.4 

1773.7 16148.3 439.2 401.9 

1257.8 12330.2 303.4 295.5 

1152.9 12140.0 291.5 286.5 


40931 


473668 


WAG-ML91+- 

1 
4 
7 
11 
12 


20 

21(14) 

24(1-3,14) 

27(1-3,8-10,14) 

31(1-10,14) 

32(0-10,14) 


9095.4 10537.3 316.2 535.1 
8928.9 9196.3 317.1 532.8 
6274.9 6354.9 281.4 414.0 
3658.3 5294.9 261.6 383.6 
3299.2 4813.3 259.1 365.1 


12789 


496804 


LG-ML91+- 

1 
4 
7 
11 
12 


20 

21(14) 

24(1-3,14) 

27(1-3,8-10,14) 

31(1-10,14) 

32(0-10,14) 


13669.8 1806.0 487.1 593.4 
12176.2 1188.8 421.4 558.0 
6325.7 811.6 340.6 391.6 
3983.0 636.0 267.0 329.8 
3878.5 574.7 267.1 314.9 


5732 


436557 


KHG-ML200- 

1 
4 
7 
11 


20 

21(14) 
24(1-3,14) 
27(1-3,8-10,14) 
31(1-10,14) 


15063.5 953.4 12568.9 403.6 593.6 

15078.6 955.4 12570.9 405.6 595.6 
6398.0 540.7 5683.3 297.4 399.3 
4611.5 533.4 3804.2 259.9 358.0 
4429.9 518.7 3006.1 251.7 334.1 







" Parameter id numbers in the parenthesis mean ML parameters in eaeh model and other parameters except for /3 = 1 and 
Wo = are fixed to the value of the corresponding parameter listed in the column of the ML-91 or the ML-200 in Table [S] 
each id number corresponds to the parameter id number listed in Table [S] 

AAIC = 2NiKL{0) + 2x #parameters with ~ 5919000 for JTT, f» 1637663 for WAG, N 10114373 for LG, 
N ^ 169269 for cpREV, and A^ ^ 137637 for mtREV; see text for details. 

" iKhiO) = -{f\0)/N + 2.97009788) for the KHG-derived amino acid substitution probability matrix, and -{e{e)/N + 
4.19073314) for the KHG codon substitution probability matrix; see text for details. 



Table 6. ML estimates of the present models with the respective selective constraints 
for the 1-PAM amino acid substitution matrices of JTT, WAG, and LG. 





JTT 


WAG 


LG 




WAG- " 


LG- " 


KHG- " 


JTT- " 


LG- " 


KHG- " 


JTT- " 


WAG- " 


KHG- " 




ML9H-11 


ML200-11 


ML9H-11 


ML200-11 


ML91 


+-11 


ML200-11 


—wo 


(0.0) 


(0.0) 


(0.0) 


(0.0) 


(0.0) 


(0.0) 


(0.0) 


(0.0) 


(0.0) 




1.08 


1.32 


1.07 


1.04 


1.28 


1.01 


0.830 


0.798 


0.757 




0.429 


0.304 


0.257 


1.29 


0.921 


0.648 


1.45 


1.543 


0.577 


tntclag /'rh[tc][ag] 


2.36 


2.42 


1.26 


1.19 


1.71 


0.850 


1.16 


1.82 


0.783 


rhag/rhtclag 


1.22 


1.16 


0.915 


1.26 


1.27 


1.00 


1.20 


1.26 


0.869 


rhta/rhitciiag] 


0.649 


0.654 


1.32 


0.814 


0.802 


1.54 


0.668 


0.634 


1.59 


rhtg/rh^tdWag] 


1.13 


1.01 


0.622 


0.862 


0.947 


0.568 


0.988 


1.20 


0.524 


'maa/rh\tc]\ag] 


1.18 


1.31 


0.605 


1.27 


1.33 


0.597 


1.24 


1.20 


0.446 


rmut 
Jt + a 


0.481 


0.507 


0.578 


0.351 


0.405 


0.512 


0.333 


0.335 


0.534 


Pmut / ?mut 
Jt 1 Jt + a 


0.527 


0.488 


0.490 


0.548 


0.527 


0.519 


0.462 


0.518 


0.463 


Pmut / ?inut 
Jc 1 Jc+g 


0.429 


0.390 


0.413 


0.461 


0.435 


0.463 


0.455 


0.468 


0.446 


a 


1.09 


1.28 


0.604 


0.893 


0.751 


^ 


0.886 


0.718 


^ 


TO 


0.0263 


0.0310 


0.0363 


0.0220 


0.0230 


0.0275 


0.0246 


0.0231 


0.0444 


^parameters 


31 


31 


31 


31 


31 


31 


31 


31 


31 


7ifL(0) X 10** ' 


27346 


32239 


36897 


33306 


15653 


13945 


59707 


23488 


14554 


AAIC 


3299.2 


3878.5 


4429.9 


1152.9 


574.7 


518.7 


12140.0 


4813.3 


3006.1 


Ratio of substitution 




















rates per codon 




















the total base/codon 


1.35 


1.32 


1.19 


1.51 


1.45 


1.19 


1.47 


1.49 


1.12 


transition /transversion 


1.23 


1.25 


1.02 


0.815 


0.959 


0.753 


0.902 


1.08 


0.789 


non- / synonymous'' 


1.49 


1.17 


0.612 


2.07 


1.59 


0.577 


1.56 


1.60 


0.293 


For cr ^« 




















the total base/codon 


1.19 


1.13 


1.09 


1.37 


1.33 


1.19 


1.34 


1.39 


1.12 


transition/transversion 


1.51 


1.57 


1.06 


0.923 


1.10 


0.753 


1.03 


1.29 


0.789 


non- /synonymous'* 


1.03 


0.755 


0.449 


1.54 


1.19 


0.577 


1.14 


1.20 


0.293 


For Wab ~ and a — s> 




















the total base/codon 


1.38 


1.29 


1.18 


1.66 


1.60 


1.38 


1.68 


1.80 


1.34 


transition /transversion 


1.27 


1.28 


0.642 


0.645 


0.926 


0.440 


0.622 


0.989 


0.390 


non- /synonymous'* 


4.67 


3.99 


3.71 


8.62 


7.02 


5.35 


8.79 


9.49 


5.23 



" In all models, equal codon usage (/J^"^^^" = fa^^^'^ = /"^^*^° = /"^^^° = 0.25) is assumed. If the value of a parameter is 
parenthesized, the parameter is not variable but fixed to the value specified. 

'' IklW = + 2.98607330) for JTT, -{l{e)/N + 2.97444860) for WAG, and -{e{0)/N + 2.96853414) for LG. 

" AAIC = 2NiKL{0) -I- 2x #parameters with N ~ 5919000 for JTT, N ^ 1637663 for WAG, and N ^ 10114373 for LG; 
see text for details. 

'' Note that these ratios arc not the ratios of the rates per site but per codon; see text for details. 



Table 7. ML estimates of the present models with the respective selective constraints 
for the 1-PAM amino acid substitution matrices of cpREV and mtREV. 





cpREV 


mtREV 




JTT- " 


WAG- " 


LG- " 


KHG- " 


JTT- " 


WAG- " 


LG- " 


KHG- " 




ML91+-11 


ML200-11 


ML91+-11 


ML200-11 


— Wo 


(0.0) 


(0.0) 


(0.0) 


(0 0) 


(0.0) 


(0.0) 


(0.0) 


(0 0) 






n Q77 


1 1 S 

J — LO 


1.02 


n fiqn 


U.O^U 


n Q77 
u. y ( ( 


0.752 


77?, 1 r 1 


u.ouu 


n Ql 7 


n Ri 1 

U.Ul 1 


0.521 


u.tjuy: 


n ^94 




0.228 


TTlfrlnn / fTilfr] \n nl 
tc,| ay / [Lcj [ciyj 


1.50 


2.23 


2.353 


1.14 


2.01 


3.43 


3.82 


1.64 


I'^agi ' "'tc.\(ig 


1.28 


1.30 


1.24 


0.973 


1.06 


1.13 


1.08 


0.752 




0.746 


0.705 


0.733 


1.61 


0.681 


0.595 


0.638 


2.00 


Tflfn 1 TTX\frA \nn} 


1.17 


1.37 


1.25 


0.747 


0.792 


0.893 


0.839 


0.411 


Tflrn / iril+f.] \n n} 


1.23 


1.17 


1.26 


0.566 


1.65 


1.67 


1.76 


0.623 


jmut 


u.zoo 


u.ouu 




0.442 


n 9fi9 


n 97n 

U.Z f u 


n 9S7 


0.426 


?mut / ?mut 
It 1 Jt+a 


n fii 1 


U.UU4: 


u.uuy 


0.597 


U.UU -L 


fl fi^9 
u.uu^ 




0.631 


?mut / ?mut 

Jc 1 Jc+g 


0.425 


0.446 


0.393 


U.4Z0 


0.349 


0.304 


0.260 


U.ooz 


(T 


i.yo 


1 /I Q 


1.(0 


n 1 '\R 

U. -LUO 


Q /I Q 


Z.lo 


Q Q7 
O.O / 


9 SQ 




U.Uozo 


U.U280 


n noon 
U.U339 


U.Uzoo 


n AC no 
U.UuUo 


U.U440 


U.U6o3 


u.uy/o 


,. 

^pcH'clIIlGtGrS 


O-L 


o ± 


O-L 








O-L 


01 


J-KL\y) X lU 


67803 


58229 


60586 


ODUo/ 


81541 


110126 


91860 


nQQQ'7 


A A ^^ ^ 


291.5 


259.1 


267.1 




286.5 


365.1 


314.9 


004:. J- 


J-LCILjUJ \J\. O Ll Ui3 Li It U. LiUli 


















rates per codon 


















the total base/codon 


1.45 


1.46 


1.41 


1.20 


1.36 


1.37 


1.33 


1.23 


transition/transversion 


1.05 


1.20 


1.25 


1.05 


1.44 


1.65 


1.74 


1.45 


non- /synonymous'' 


1.74 


1.80 


1.38 


0.631 


0.908 


1.04 


0.772 


0.403 


For o- 


















the total base/codon 


1.21 


1.26 


1.20 


1.16 


1.11 


1.15 


1.09 


1.05 


transition /transversion 


1.42 


1.66 


1.77 


1.07 


2.52 


2.73 


3.31 


1.96 


non- /synonymous'* 


1.03 


1.10 


0.794 


0.573 


0.387 


0.515 


0.312 


0.163 


For Wah ~ and cr — 5- 


















the total base/codon 


1.45 


1.55 


1.44 


1.33 


1.31 


1.37 


1.26 


1.16 


transition /transversion 


0.797 


1.20 


1.25 


0.569 


1.06 


1.78 


1.98 


0.883 


non- /synonymous'' 


6.06 


6.33 


5.14 


4.97 


3.40 


3.09 


2.58 


3.02 



" In all models, equal codon usage = fa^^^" = /"^^^° = /"^^^° = 0.25) is assumed. If the value of a parameter is 

parenthesized, the parameter is not variable but fixed to the value specified. 

Ikl{0) = -{i{0)/N + 2.95801048) for cpREV, and -{e{e)/N + 2.85313622) for mtREV; see text for details. 
'= AAIC = 2NiKL{0) -I- 2x #parameters with N ^ 169269 for cpREV, and N ^ 137637 for mtREV; see text for details. 

Note that these ratios are not the ratios of the rates per site but per codon; see text for details. 



Table 8. ML estimates of the present models with the respective selective constraints for 
the 1-PAM KHG-derived amino acid and KHG codon substitution matrices. 





KHG (amino acid) 


KHG (codon) 




JTT- " 


WAG- " 


LG- " 


JTT- " 


WAG- " 


LG- " 




ML91-f-ll 


ML9H-12 


—wo 


(0.0) 


(0.0) 


(0.0) 


1.29 


1.50 


1.11 




0.952 


0.912 


1.22 


1.72 


2.02 


1.91 


m[tc][a9l 


1.545 


1.68 


1.33 


1.23 


1.21 


1.15 


rhtc\ag /'rhltc]lag] 


1.19 


1.73 


1.69 


0.992 


1.07 


1.09 




1.24 


1.28 


1.22 


1.09 


1.12 


1.10 


rnta/rhitc][ag] 


0.689 


0.682 


0.748 


1.26 


1.25 


1.25 


rhtg/rhytc] [ag] 


0.855 


1.07 


0.943 


0.646 


0.662 


0.671 


'Tlca/nT.[tc][ag] 


1.32 


1.26 


1.31 


0.815 


0.806 


0.813 


?mut 
Jt + a 


0.317 


0.334 


0.377 


0.480 


0.484 


0.488 


?mut / ?mut 
Jt 1 Jt + a 


0.533 


0.579 


0.512 


0.499 


0.499 


0.493 


Pmut / ?mut 
Jc / Jc+g 


0.460 


0.480 


0.441 


0.464 


0.459 


0.459 


a 


2.64 


2.25 


1.30 


^ 


0.0496 


-+ 


TO 


0.0308 


0.0286 


0.0247 


0.0240 


0.0247 


0.0240 


^parameters 


31 


31 


31 


32 


32 


32 


/a'l(0) X 10** " 


40931 


12789 


5732 


473668 


496804 


436557 


Ratio of substitution 














rates per codon 














the total base/codon 


1.64 


1.66 


1.59 


1.29 


1.29 


1.29 


transition/transversion 


0.772 


0.859 


0.891 


0.759 


0.765 


0.767 


non- /synonymous'^ 


2.56 


2.61 


2.03 


0.728 


0.727 


0.724 


For cr — s> 














the total base/codon 


1.39 


1.45 


1.43 


1.29 


1.28 


1.29 


transition /transversion 


0.977 


1.15 


1.08 


0.759 


0.770 


0.767 


non- /synonymous'^ 


1.48 


1.54 


1.36 


0.728 


0.704 


0.724 


For Wab ~ and a — s- 














the total base/codon 


1.71 


1.83 


1.75 


1.65 


1.65 


1.64 


transition/transversion 


0.637 


0.926 


0.892 


0.51 


0.552 


0.561 


non- /synonymous'^ 


9.41 


10.3 


8.86 


8.16 


8.07 


7.77 



" In all models, codon frequencies are taken to be equal to the observed ones. If the value of a parameter is parenthesized, 
the parameter is not variable but fixed to the value specified. 

IklW = ~{i{0)/N + 2.97009788) for the KHG-derived amino acid substitution probabiUty matrix, and -(1(0) /N + 
4.19073314) for the KHG codon substitution probability matrix; see text for details. 

Note that these ratios are not the ratios of the rates per site but per codon; see text for details. 



Table 9. Log-likelihoods of a phylogenetic tree [6] of the concatenated sequences of 12 
protein-coding sequences encoded on the same strand of mitochondrial DNA from 20 vertebrate species 
with 2 races from human. 



Codon Substitution 
ModeP 




e+ 

116898.6 


AIC- 
233917.3 


(7 


m[tc][ag] 


mtc\ag/m[tc][ag] 


LG-l-F'^ 


60 


-1293.8 


2587.6 








KHGaa-l-F'^'' 


60 


-1293.0 


2586.1 








WAG-l-F'^ 


60 


-1108.1 


2216.1 








JTT-1-F'= 


60 


-836.4 


1672.8 








mtREV-l-F^ 


60 


0.0 


0.0 








No-Constraints- l-F'^ 


60 


-1731.0 


3462.1 


(2.46) 


(0.040) 


(3.24) 


WAG-ML91+-1-F^ 


60 


1021.4 


-2042.7 


(2.18) 


(0.524) 


(3.43) 


JTT-ML91+-1-F'= 


60 


1237.7 


-2475.5 


(3.48) 


(0.564) 


(2.01) 


LG-ML91+-1-F^ 


60 


1382.2 


-2764.4 


(3 37) 


fO 321) 


(3.82) 


EI-l-F'^ 


60 


1395.8 


-2791.6 


(0.339) 


(0.737) 


(3.06) 


KHG-ML200-1-F^ 


60 


1676.9 


-3353.9 


(2.89) 


(0.228) 


(1.64) 


No-Constraints- 11-F 


70 


772.2 


-1524.4 


0.906 


0.273 


3.37 


EI-12-F 


71 


1966.6 


-3911.2 


0.326 


0.549 


3.60 


WAG-ML91+-12-F 


71 


2268.3 


-4514.5 


1.84 


0.471 


4.16 


JTT-ML91-I-12-F 


71 


2275.1 


-4528.1 


3.57 


0.506 


2.91 


KHG-ML200-12-F 


71 


2355.7 


-4689.4 


0.469 


0.226 


2.50 


LG-ML91-^-12-F 


71 


2510.0 


-4997.9 


1.26 


0.357 


4.32 


No-Constraints- ll-F-dG4 


71 


2495.4 


-4968.9 


0.000 


0.182 


3.62 


EI-12-F-dG4 


72 


3742.4 


-7460.7 


0.000 


0.392 


3.95 


JTT-ML91+-12-F-dG4 


72 


4156.9 


-8289.8 


0.064 


0.385 


3.11 


KHG-ML200-12-F-dG4 


72 


4190.0 


-8356.0 


0.000 


0.147 


2.60 


WAG-ML91+-12-F-dG4 


72 


4196.4 


-8368.7 


0.042 


0.342 


4.61 


LG-ML91-h-12-F-dG4 


72 


4412.6 


-8801.1 


0.029 


0.253 


4.83 



° In all models named with a suffix "F", codon frequencies are taken to be equal to those in coding 
sequences. A suffix "dG4" means the discrete approximation of the F distribution with 4 categories [Hj 
for rate variation. The parameter wq in Eq. [Tl]is optimized in all models. 

^ The number of parameters; the value for the mtREV-l-F is not quite correct, because mtREV was 
estimated from the almost same set of protein sequences [5]. 

The exchangeabilties of nonsynonymous and synonymous codon pairs are equal to exp wq multiplied by 
those of the corresponding amino acid pairs and all equal to the mean amino acid exchangeability in the 
empirical amino acid substitution matrix specified, respectively. 

KHGaa means the amino acid substitution matrix derived from KHG. 

All parameters except wq and codon frequencies are fixed to those ML estimates of each model fitted 
to mtREV. 



Supporting Information Legends 



Text SI. Supporting information consisting of the following sections. 

1. A method for the physico-chemical evaluation of selective constraints on amino acid replace- 
ment. 

2. Models with no amino acid dependences of selective constraints. 

3. A physico-chemical evaluation of selective constraints on amino acids. 

4. Other physico-chemical evaluations of selective constraints on amino acids. 

5. Evolutionary process of amino acid substitutions in terms of log-odds. 

Data SI. A computer-readable dataset of the ML estimates of parameters in the ML-200 for KHG, and 
the ML-91 and the ML-91+ for LG, WAG, and JTT as well as the EI. 
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Supplementary Methods 



A method for the physico-chemical evaluation of selective constraints on amino 
acid replacements 

Physico-chemical evaluations of {wab} are not meaningless, even though selective constraints {wat} in Eq. 
[9] in the text can be optimized for observed data. Their performance in reproducing observed substitution 
data indicates how extensively selective constraints on amino acid substitutions can be explained by 
physico-chemical requirements on amino acid substitutions to preserve protein structures and functions. 
In this section, a new physico-chemical method for the evaluation of the selective constraints is introduced. 

The rate of acceptance in amino acid replacements is assumed here to be proportional to the mean 
relative stability of the native conformation C of the mutant type of sequence S' to that of the wild 
type of sequence S. The probability P{C\S) of a conformation C that a sequence S takes is equal to 
the Boltzmann factor of C divided by the conformational partition function of S. The conformational 
partition function of a protein may be crudely approximated in the high temperature expansion. 

log P{C\S) 

-[l0g( J2 l)-^(^(C,'5))compact,T^oo] (Sl-1) 

C€:{compact} 

where k is the Boltzmann constant, T is temperature, and £{C,S) is the conformational free energy of 
the conformation C taken by the sequence S. First, the sum over conformations C are approximated 
by the sum over compact/nativelike conformations whose energies are significantly lower than those of 
extended conformations. Then, the logarithm of the partition function is approximated by the sum of the 
first and the second terms in the high temperature expansion. Thus, the relative stability of the native 
conformation of sequence S' to that of sequence iS is estimated by 

\og[P{C\S')/P{C\S)] 

^ -±,i£{C,S')-£iC,S)) , 

if the amino acid composition does not change. (Sl-2) 

The mean energy of compact conformations does not depend on the details of the amino acid order in 
protein sequences but primarily on the amino acid composition. Therefore, if the amino acid composition 
keeps constant during amino acid substitutions, as indicated by the present assumption of the stationary 
state for amino acid substitutions, the relative stability can be approximated by the difference of the 
native conformational energies of the two sequences. 

As a result, the parameter Wab, whose exponent is the acceptance rate of substitutions between amino 
acids of type a and type b, is evaluated here to be proportional to the mean free energy increment caused 
by a substitution between amino acids of type a and type b. Then, the mean free energy increment is 
approximated by the sum of two terms one of which results from the increment of contact energy between 
amino acids in a protein structure and the other from the change of side-chain volume. 

Wab = -l3[Ae:b + ^el,]+wo{l~Sab) (Sl-3) 

where /3 is a parameter, A.i'^f^ is the mean increment of contact energy between amino acids due to an 
amino acid exchange between amino acids of type a and type 6 in a protein structure, and Ai^b is the 



mean increment of free energy caused by the change of side-chain volume between amino acids of type 
a and type b. The exponent of the constant term, e"'", may represent the ratio of replaceable amino 
acid sites in a protein sequence, and then the first term represents the ratio of neutral substitutions at 
such mutable sites; the ratio of nonsynonymous to synonymous mutations is primarily determined by wo- 
However, wo may be positive, meaning positive selection. 



Mean energy increment for each type of amino acid substitutions 

To consider the mean contact energy increment due to an amino acid replacement between amino 
acids of type a and type 6, we must note that the evolutionary process of amino acid substitutions in 
proteins is assumed here to be in the stationary process, which means that the amino acid composition 
of proteins must be kept constant in the whole process of amino acid substitutions. To keep the amino 
acid composition constant, an exchange of amino acids in a protein may be considered as the process of 
substitutions. A mean contact energy increment, 2Ae^jj, due to an exchange between amino acids of type 
a and type b in a protein can be estimated |17j by averaging the difference of interaction energies over 
surrounding residues as 

Ail, = Ail, = J2(^''c-e,,)i^-^)>0 (Sl-4) 

where Caci— ^ca) is the contact energy between amino acids of type a and c, and Nad— ^ca) is a half 
of the observed number of contacts between amino acids of type a and type c, and is the number 
of amino acids of type a in protein structures. The contact energies Cat and the number of contacts 
Nab are the ones evaluated from the numbers of contacts between amino acids observed in representative 
protein structures |46| . The mean energy increment due to an amino acid exchange is non- negative 
for any pair of amino acids |17| . because the contact energies are derived by assuming that the native 
conformations of proteins are at the minimum of the total contact energy. This means that no favorable 
substitutions occur in protein evolution in which amino acid substitutions are in the stationary state. 
Thus, the assumption of the stationary state for amino acid substitutions is consistent [46] with the 
neutral theory |47| of molecular evolution. 

A contact potential used is a statistical estimate [46] of contact energies with a correction [48] for 
the Bethe approximation [JHUSO]- The contact energy between amino acids of type a and type b was 
estimated as 

I ^TA^Bcthc I A ^Bcthc , r^Bcthcl /oi r-\ 

Bab = err + a[Aear + Ae^f, H -deab (Sl-5j 

a' 

Crr is part of contact energies irrespective of residue types and is called a collapse energy, which is 
essential for a protein to fold by cancelling out the large conformational entropy of extended conformations 
but cannot be estimated explicitly from contact frequencies between amino acids in protein structures. 
Ae|^°"^'^ and ^ej^^"^'^ are the values of Acar and Seab evaluated by the Bethe approximation from the 
observed numbers of contacts between amino acids. Acar + e^r is a partition energy or hydrophobic 
energy for a residue of type a. Scab is an intrinsic contact energy for a contact between residues of type 
a and type 6; refer to |48 1 l50 j for their exact definitions. The proportional constants for correction were 
estimated as (3' /a' = 2.2 and a' < 1 [15] • Here, energy is measured in fcT units. The scaling constant /? 
in Eq. ISl-3l in the text is given for a' = 1. 

The energy increment Ae^j,, which results from a replacement between amino acids of different sizes, 
is assumed here to be proportional to the volume difference between amino acids of the type a and type 
b: 

^ilb = vi J^'^fJ'^ Wa^Vbl (Sl-6) 

l^a,b\^a ~ Vb\ 



where Va is the volume of amino acid a, and u is a proportional constant. The value of v is taken to 
be equal to one, otherwise specified; that is, the contact energy increment and the volume change are 
assumed to contribute to the total free energy increment and the acceptance rate with an equal weight. 
The amino acid volumes used here are the mean volume occupied by each type of amino acid in protein 
structures, and taken from the set named BL+ in Table 6 of Tsai et al. [51]; the volume of a half cystine 
(labeled as "cys" in the table) is used here for a cysteine. 

The values of [Ae^j, + Ae^j,] for all amino acid pairs are provided in Supporting Information, Data SI. 



Supplementary Results 



Models with no amino acid dependences of selective constraints 

Before examining the effects of selective constraints {wab) on likelihood, ML values for the models with 
no amino acid dependences of selective constraints, i.e., = in Eq. [Tl] were calculated for JTT, WAG, 
cpREV, and mtREV. The AAIC value and the ML estimators of m^,,, /|""*, and a for each 

model are listed in Table [5] and Table ISll respectively. Please note that wq is fixed here to 0, and so 
there is completely no selection pressure on nonsynonymous replacements; the likelihoods of amino acid 
substitution matrices do not strongly depend on wq and codon substitution data are required to reliably 
estimate the value of Wq. ML parameters in each model are specified by the parameter id numbers written 
in the parenthesis in the second column; each id number corresponds to the parameter id number listed 
in Table [3] Each model is called the No- Constraints model with a suffix meaning the number of ML 
parameters; see Table[T] Although No-Constraints models corresponding to the Kimura's two-parameter 
model [1] , the model of Hasegawa et al. [2] , the Tamura-Nei model [3j and the general reversible model 
[52j were examined, only three models for each matrix are shown in Table [2j 

The bias toward transition has been often pointed out I53j . In the present results for the No- 
Constraints models, t the ratio of transition to transversion exchangeability mtc\ag / ''TT-[tc][ag] is evaluated 
to be between 1.5 and 3.3 for all four matrices of JTT, WAG, cpREV, and mtREV, although that for 
mtREV is larger than those for the others. For the No-Constraints-1 of mtREV, its parameter is evaluated 
to be rhtc\ag / ''TT'ltc][ag] = 2.32 and the ratio of the total transition to the total transversion rate is equal 
to 1.24. This estimate of transition to transversion exchangeability bias for mitochondrial proteins is 
significantly smaller than the previous estimate by a maximum likelihood method for phylogcny. Yang 
et al. [7] estimated rhtc\ag/'>^[tc]lag] = 9.157 for the model corresponding to the No-Constraints-1 in the 
analyses of the most likely phylogeny of mitochondrial DNA encoding proteins. 

Although the significance of each parameter is indicated by the AIC values of the No-Constraints 
models with the various sets of parameters, its discussion is postponed until the next section where 
results for models with selective constraints are presented, because no selective constraints on amino 
acids is a completely wrong assumption. 



A physico-chemical evaluation of selective constraints on amino acids 

Let us examine how the likelihood of JTT is improved by using the present formula for selective con- 
straints. Eq. [TlJ The first evaluation of selective constraints on amino acids is based on the mean energy 
increments due to an amino acid replacement that result from the changes of pairwise contact energies 
[TTlllHlllHllin] and the volume change [5T] of an amino acid side chain by an amino acid replacement. 
This model in which selective constraints on amino acids are evaluated from mean energy increments 
due to an amino acid replacement is called here an Energy-increment-based (EI) model with a suffix 
meaning the number of ML parameters; see Table [TJ The ML values for the EI models with various sets 
of parameters are listed in Table[2l and the ML estimates for the EI-10 and the EI-11 are listed in Table 

m 

The No-Constraints-1, the No-Constraints-10, and the No- Constraints- 13 models correspond to a 
special case of /3 = in the EI-2, the EI-11. and the EI-14 models, respectively. As a matter of course, the 



selective constraints on amino acids that represent conservative selection against amino acid substitutions 
significantly improve the AAIC values for all substitution matrices. 

The significance of multiple nucleotide changes in a codon is indicated by the improvements of the 
AAIC between the EI-3 and the EI-4, between the EI-12 and the EI-13M, between the EI-10 and the 
EI-11, and between the EI-13 and the EI-14 models, in the latter of which the parameter m[tc][ag] for 
multiple nucleotide changes is optimized as a free variable. Also, the AAIC is improved by the inclusion 
of the scale parameter ct; compare the AAIC values between the EI-2 and the EI-3, between the EI-IOM 
and the EI-11, between the EI-12 and the EI-13, and between the EI-13M and the EI-14. Thus, taking 
account of both multiple nucleotide changes in a codon and variations in substitution rates is essential 
to obtain the reasonably large ML values. 

The most effective one of the remaining parameters on likelihood is the parameter for transition- 
transversion bias, mtc\ag /''Tiltc][ag]- The next effective parameters are /™"' and f^^^^°, and finally the 
remaining rate parameters. The AAIC values of the models EI-2G, EI-3, EI-7, EI-11, EI-IOMU, and EI- 
14 indicate that all parameters are effective to significantly improve the likelihood of each of the observed 
matrices. The ML estimates of the parameters /™"* and f^^^s° show the similar tendencies between 
the models, although this tendency differs among the substitution matrices, JTT, WAG, cpREV, and 
mtREV. The comparison of the AAIC values between the EI-IOMU and the EI-14 models indicates that 
the parameters for exchangeabilities except for transition-transversion bias, are statistically significant 
but are not so effective as /™"* and y^^^se ^^iq improvement of the likelihood. 

The relative weight v of the effects of volume change due to an amino acid replacement on selective 
constraints in Eq. Sl- ISl-61 is assumed to be equal to one but may be varied. Optimizing w as a free 
variable can improve the value of AAIC from 13151.9 to 12932.1 for JTT. This model may be justified 
because the effects of volume change due to an amino acid replacement on protein structures may be 
different among the types of protein structures, i.e., between membrane and soluble proteins, and between 
a and /3 proteins. 

Table [2] shows that the parameters {f^^^^°} for codon usage are significant to improve likelihood, 
however, the ML estimator of f^^^so q^^qj^ takes extremely small or large values. Thus, it may be better 
to assume equal codon usage by fixing f^^^^o _ q 25 jf codon frequencies are unknown. In the following, 
equal codon usage is assumed in most cases of unknown codon frequencies. 



Other physico-chemical evaluations of selective constraints on amino acids 

Grantham |31j and Miyata et al. |32| introduced physico-chemical distances between amino acids in 
attempts to model selective restraints against amino acid substitutions. Their physico-chemical distances 
were also used by Goldman and Yang [18] and Yang et al. [7], in which the acceptance ratio (expwab) 
was represented by using a linear formula of Miyata et al. [32] (expwaf, = a(l — /Sdab)) or a geometric 
formula (expwaf, = a exp(— /?dab)) of physico-chemical distance dab between amino acids of type a and b; 
where a and /? are parameters. In their models, stepwise substitutions through single nucleotide changes 
were assumed, and codon substitutions due to multiple nucleotide changes were completely neglected; in 
other words, ?TT-[ic][ag] with mj,,/m[tc][ag] = constant in Eq. [T]was assumed. Yang et al. [7] reported 
that the use of the Miyata's distance [32] for the acceptance ratio in their codon-based model lead to 
a better fit to the small data of mitochondrial protein sequences than the JTT-F and the mtREV24-F 
models, in which the rate matrix of JTT or mtREV24 with an adjustment for the equilibrium frequencies 
of amino acids is used; their codon-based models correspond to the present model with TO[tc][ag] 0, 
rritc = mag-, and mta = mtg — nica = meg, i.e., the two parameter model for nucleotide mutations with 
the adjustment for amino acid frequencies. 

Table [2] and Table [83] list the ML values and the ML estimates for JTT and WAG in the present 



models in which cither the Grantham's distance or the Miyata's distance {dab for an amino acid pair a 
and b) is used as yjcstimatc _ —dab to evaluate the selective constraints Wab in Eq. [TTJ 

Wab = ~f3dab + Wo{l - Sab) (Sl-7) 

where wq is always fixed to the value 0, because the likelihoods of amino acid substitution matrices do not 
significantly depend on wq- These models are called here Grantham and Miyata with a suffix meaning 
the number of ML parameters; see Table [1] Both the selective constraints based on the Grantham's and 
on the Miyata's distances significantly improve the AAIG. 

Miyata et al. |32j claimed that their new scale can explain the tendencies of amino acid replacements 
better than the Grantham's distance scale. Table [3] shows that the Miyata's physico-chemical distance 
performs better in all parameter sets than the Grantham's distance. This result is consistent with that of 
Yang et al. [7] for mitochondrial proteins. The present physico-chemical evaluation of selective constraints 
(EI model) fits JTT and WAG even better than the Miyata's distance scale, although the performances 
of both the methods are almost same for cpREV and mtREV. 

One of the important facts in these results is that allowing multiple nucleotide changes in a codon 
significantly improve the AIC irrespective of the estimations of selective constraints; compare the A AIC 
values between the Grantham-10 and the Grantham-11, and between the Miyata-10 and the Miyata-11. 
In other words, the improvement of the AIC value is not an artifact due to the present physico-chemical 
estimation of selective constraints. 



Evolutionary process of amino acid substitutions in terms of log-odds 

Kinjo and Nishikawa [55] reported that the most principal component of log-odds matrices exhibits a 
sharp transition at the sequence identity of 30-35%, which almost coincides with the twilight zone in 
homology search. This interesting feature of log-odds matrices was found by analyzing the eigenspectra 
of the log-odds matrices for 18 different levels of sequence identities, which were constructed from the 
structure-based alignments of protein sequences in the Homstrad database [54] with the procedure of 
the BLOSUM substitution matrices [55]. Although they did not mention, this feature is also encoded in 
an amino acid or codon substitution probability matrix for a short time interval such as JTT, WAG, LG, 
and KHG. Here, we show that this feature is encoded in the transition matrix estimated by the ML-91-1- 
model that precisely reproduces JTT. 

Fig. ISlll A shows the first, the second and the third principal eigenvalues of the log-odds matrix 
{log-0{{S){t))ab) of the ML-91-1- are drawn on amino acid identity by solid, broken and dotted lines, 
respectively. The dependences of these eigenvalues on the amino acid identity are almost exactly the 
same as those shown in the Fig. lA of their paper |45j : i.e., the first principal eigenvalue changes its sign 
from negative to positive at about 35 % identity, and the second principal eigenvalue takes the place of 
a negative eigenvalue by changing its sign from positive to negative. A similar event of exchanging the 
second and the third principal eigenvalues in the order occurs between 15 and 20 % identity in their case 
and at about 25 % identity in the present JTT-ML91-I- matrix; note that the value of sequence identity 
a; % on the abscissa in their Fig. lA [45] represents a log-odds matrix compiled from alignments with 
sequence identity > x % and < (x -I- 10) %. 

From Fig. ISllt V. one infers that the vector corresponding to the first principal eigenvector at about 
80 % identity becomes the second principal eigenvector at about 35 % identity and the third principal 
eigenvector at about 25 % identity. Likewise one infers that the vector corresponding to the second 
principal eigenvector at about 80 % identity becomes the first principal eigenvector below about 35 % 
identity, and the vector being equal to the third principal eigenvector at about 80 % identity becomes 
the second principal eigenvector below 25 % identity. This inference is exactly correct, as shown in Figs. 



[STTB.ISTTK. andEnt) and in Fig. IB of Kinjo and Nishikawa gg. In Figs. [STTB.ISTTlC. andEIIP, the 
inner product V^it) ■ yj'^'^(20PAM) of the ith principal eigenvectors y,(t) of the JTT-ML91+ log-odds 
matrix at time t and the jth principal eigenvectors Vj'^"'"(20PAM) of the JTT log-odds matrix at 20 
PAM is plotted against sequence identity at time t. Fig. ISllI indicates that the eigenvalues change but 
the eigenvectors remain almost the same until sequence identity attains about 20 %. The sharp exchange 
between the first and the second principal eigenvalues is not peculiar to the present substitution matrices 
but can occur in any transition matrix in which diagonal elements differ from each other; transition 
matrices generated with Rab = const ■ /f, have such a characteristic feature. A critical point is what the 
principal eigenvectors are as well as those eigenvalues. 

The first principal eigenvalues of the log-odds matrices are large negative in t > 40 % identity, 
contributing negative values to the diagonal elements of the log-odds matrices. Thus, the first principal 
eigenvector with a large negative eigenvalue is a primary contribution to the mutability of each amino 
acid, as pointed out in Kinjo and Nishikawa |45| . On the other hand, the second and the third principal 
eigenvalues are positive, so that the product of ith and jth elements of their eigenvectors represents how 
often the ith and the jth types of amino acids can be replaced to each other. Kinjo and Nishikawa [45] 
showed that the second principal eigenvector is well correlated with a hydrophobicity scale of amino acids. 

Thus, the sharp transition in the order of the eigenvalues contributing to the mutabilities of amino 
acids and to the replaceabilities of amino acid pairs at about 35 % identity means that the memory 
of ancestral sequences disappear and amino acids in the sequences are replaced with similar physico- 
chemical types of amino acids at about 35 % identity. This explains why it becomes hard to identify 
homologous relationships between sequences whose similarities are less than 35 % identity |45| . Barriers 
for identifying sequence homologies may also exist at about 25 % and 15 %, where the second and the 
third sharp transitions in the order of the eigenvalues occur. Because conservative substitutions in respect 
to physico-chemical properties of amino acids are required for proteins to fold into their native structures, 
the second barrier at about 25 % corresponds to a threshold for being able to detect structural homology 
between proteins. The similar characteristic features are observed in the mtREV and the cpREV matrices, 
too. Thus, the characteristic features becoming manifest after a long evolutionary history of proteins are 
completely encoded in the transition matrices based on the reversible Markov model. This fact supports 
in some extent the appropriateness of the present Markov model to describe the evolutionary process of 
codon substitutions. 



-40 



-40 



-40 



-30 



-20 



-10 



-40 



-30 



-20 



-10 



Log-Odds of WAG 



Log-Odds of WAG 



Figure SI. The ML-87 and the ML-91 models fitted to WAG. Each element 
log-0{{S){T,a))ab of the log-odds matrices of (A) the ML-87 and (B) the ML-91 models fitted to the 
1-PAM WAG matrix is plotted against the log-odds log-0(5'^^°(l PAM))^,, calculated from WAG. 
Plus, circle, and cross marks show the log-odds values for one-, two-, and three-step amino acid pairs, 
respectively. The dotted line in each figure shows the line of equal values between the ordinate and the 
abscissa. 
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Figure S2. Comparison between various estimates of selective constraint for each amino acid 
pair The ML estimates of selective constraint on substitutions of each amino acid pair are compared between 
tire models fitted to various empirical substitution matrices. The estimates Wab for multi-step amino acid pairs 
tliat belong to the least exchangeable class at least in one of the models are not shown. Plus, circle, and cross 
marks show the values for one-, two-, and three-step amino acid pairs, respectively. 







' ' ' ' I ' ' ' ' I ' ' ' ' I ' ' ' ' I ' ' ' ' I 
2 4 6 

Mean Energy Increment 
















2 4 6 

Mean Energy Increment 



8 



Figure S3. Selective constraint for each amino acid pair estimated from WAG and from 

LG. The ML estimate, _iyWAG-ML9i+ _^lg-ml91+ ^g^,^ selective constraint on 

substitutions of eacli amino acid pair in tiie ML-91+ models fitted to the 1-PAM matrices of WAG and 
LG is plotted against the mean energy increment due to an amino acid substitution, (AeJ^^ + Ae^j^) 
defined by Eqs. Sl-4, Sl-5, and Sl-6. The estimates Wat for the least exchangeable class of multi-step 
amino acid pairs are not shown. Plus, circle, and cross marks show the values for one-, two-, and 
three-step amino acid pairs, respectively. 
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Figure S4. Comparison of the ML estimates of selective constraint for each amino acid 
pair between the ML-87 and the ML-91 models. The ML estimate of selective constraint for 
each single step amino acid pair in the ML-87 model fitted to (A) the 1-PAM JTT matrix or (B) the 
1-PAM WAG matrix is plotted against that in the ML-91 model. 
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Figure S5. Models fitted to each of JTT, WAG, and LG. Each element log-0((S')(f , a))ab of the 
log-odds matrix of the model fitted to each empirical substitution matrix is plotted against the log-odds 
log-0(S°'"^(l PAM))a(, calculated from the corresponding empirical substitution matrix. Plus, circle, and cross 
marks show the log-odds values for one-, two-, and three-step amino acid pairs, respectively. The dotted line in 
each figure shows the line of equal values between the ordinate and the abscissa. 
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Figure S6. Models fitted to each of cpREV and mtREV. Each element log-0({S){f, a))ab of the 
log-odds matrix of the model fitted to each empirical substitution matrix is plotted against the log-odds 
log-0(S'°'^*'(l PAM))a6 calculated from the corresponding empirical substitution matrix. Plus, circle, and cross 
marks show the log-odds values for one-, two-, and three-step amino acid pairs, respectively. The dotted line in 
each figure shows the line of equal values between the ordinate and the abscissa. 
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Figure S7. Models fitted to the KHG-derived amino acid substitution matrix. Each 
element log-0((5)(f, (T))ab of the log-odds matrix of the model fitted to the 1-PAM KHG-derived amino 
aeid substitution matrix (KHGaa) is plotted against the log-odds log-0(S'°'^'^(l PAM))ab calculated 
from KHGaa. Plus, circle, and cross marks show the log-odds values for one-, two-, and three-step 
amino acid pairs, respectively. The dotted line in each figure shows the line of equal values between the 
ordinate and the abscissa. 
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Figure S8. The JTT-ML91+-12 model fitted to the 1-PAM KHG codon substitution 
matrix. Each element log-0((5')(f, ct))^^ of the log-odds matrix corresponding to (A) single, (B) 
double, and (C) triple nucleotide changes in the JTT-ML91-f-12 model fitted to the 1-PAM KHG 
codon substitution matrix is plotted against the log-odds log-0(S'^^'^(l PAM))^^ calculated from 
KHG. Upper triangle, plus, circle, and cross marks show the log-odds values for synonymous pairs and 
one-, two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of 
equal values between the ordinate and the abscissa. 
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Figure S9. The WAG-ML91+-12 model fitted to the 1-PAM KHG codon substitution 
matrix. Each clement log-0((5)(f, (t))^,^ of the log-odds matrix corresponding to (A) single, (B) 
double, and (C) triple nucleotide changes in the WAG-ML91+-12 model fitted to the 1-PAM KHG 
codon substitution matrix is plotted against the log-odds log-0(S'^^'^(l PAM))^,^ calculated from 
KHG. Upper triangle, plus, circle, and cross marks show the log-odds values for synonymous pairs and 
one-, two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of 
equal values between the ordinate and the abscissa. 
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Figure SIO. The LG-ML91+-12 model fitted to the 1-PAM KHG codon substitution 
matrix. Each element log-0((5)(f, 0-))^,^ of the log-odds matrix corresponding to (A) single, (B) 
double, and (C) triple nucleotide changes in the LG-ML91+-12 model fitted to the 1-PAM KHG codon 
substitution matrix is plotted against the log-odds log-0(S'^^'^(l PAM))^i, calculated from KHG. 
Upper triangle, plus, circle, and cross marks show the log-odds values for synonymous pairs and one-, 
two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of 
equal values between the ordinate and the abscissa. 
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Figure Sll. Temporal changes of the eigenvalues and the eigenvectors of the log-odds matrix 
log-0{{S){t)) calculated by the ML-91-|- model fitted to JTT as a function of sequence identity. In 

(A), the solid, the broken, and the dotted lines show the temporal changes of the first (Ai), the second (A2), and 
the third (A3) principal eigenvalues, respectively. The inner products of the eigenvectors with the eigenvectors of 



the JTT 20-PAM log-odds matrix, Vi{t) ■ Vj ^ (20-PAM), are shown in (B) for the first principal eigenvector 
(i = 1), in (C) for the second principal eigenvector (i = 2), and in (D) for the third principal eigenvector (i = 3), 
by sohd lines for j = 1, by broken lines for j = 2, and by dotted lines for j — 3. 



Table SI. ML estimates of the present models without selective constraints on amino acids for the 
1-PAM substitution matrices of JTT, WAG, cpREV, and mtREV. 
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(1 0) 
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0.708 


(1 0) 
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1.0^ 
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f mut 

Jt + a 


(0.5) 


0.495 
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rinut / rniut 
H 1 H + a 
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10 


rmut / rmut 

Jc iJc+g 
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#parameters 


21 
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Ik Lie) X 108 b 
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207260 
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233841 


1014962 


249448 
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305500 




AAIC 


86428.1 
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3478.0 
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2644.1 
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Ratio of substitution rates 




















per codon 




















the total base/codon 


1.0 


1.30 


1.0 


1.47 


1.0 


1.40 


1.0 


1.35 




transition / trans version 


1.13 


1.00 


0.848 
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1.11 


1.02 


1.24 
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" In all models, equal codon usage (j^"^^S° = /a^^^° = fc^^^'^ = /g'^^^'^ = 0.25) is assumed. If the value of a parameter is 
parenthesized, the parameter is not variable but fixed to the value specified. 

Ikl{0) = -(£(e)/Af -I- 2.98607330) for JTT, -(f(e)/Ar -I- 2.97444860) for WAG, -(f(0)/7V -f 2.95801048) for cpREV, and 
-ie{e)/N + 2.85313622) for mtREV; see text for details. 

" AAIC = 2NIkl{0) + 2x #parameters with N ~ 5919000 for JTT, N ^ 1637663 for WAG, N « 169269 for cpREV and 
N ?s 137637 for mtREV; see text for details. 
Note that these ratios arc not the ratios of the rates per site but per codon; sec text for details. 



Table S2. ML estimates of the present models with the selective constraints based on mean energy 
increments due to an amino acid substitution (EI) for the 1-PAM substitution matrices of JTT, WAG, 
cpREV, and mtREV. 
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per codon 


















the total base/codon 


1.36 


1.35 


1.53 


1.54 


1.45 


1.48 


1.38 


1.44 


transition /transversion 


1.09 


1.11 


0.803 


0.834 


1.08 


1.13 


1.34 


1.41 


nonsynonymous /synonymous'' 


2.09 


2.13 


2.48 


2.82 


2.45 


2.65 


1.75 


1.92 


Ratio of substitution rates per codon 


















for fT — > 


















total base/codon 


1,0 


1.18 


1.0 


1.38 


1.0 


1.31 


1.0 


1.37 


transition / transversion 


1.49 


1.28 


1.25 


0.944 


1.93 


1.36 


2.35 


1.56 


nonsynonymous /synonymous'^ 


1.12 


1.59 


0.945 


2.13 


1.15 


1.99 


0.767 


1.64 


Ratio of substitution rates per codon 


















for = and ct — > 


















total base/codon 


1.0 


1.28 


1.0 


1.59 


1.0 


1.48 


1.0 


1.59 


transition / transversion 


1.31 


1.15 


0.983 


0.830 


1.51 


1.50 


2.15 


1.57 


nonsynonymous / synonymous'' 


2.57 


3.83 


2.82 


6.53 


2.74 


1.16 


1.84 


4.51 



" In all models, equal codon usage ( y^^^^se _ f^^^^'^ = f^^^^^ = f^^^^'^ = 0.25 ) is assumed. If the value of a parameter 
is parenthesized, the parameter is not variable but fixed to the value specified. 

>> IklW = -(^(0)/Af + 2.98607330) for JTT, -(£(0)/Ar + 2.97444860) for WAG, -(^(0)/7V -I- 2.95801048) for cpREV, and 
-{e{0)/N + 2.85313622) for mtREV; see text for details. 

" AAIC = 2NiKL{9) + 2x #parameters with N ~ 5919000 for JTT, N RJ 1637663 for WAG, si 169269 for cpREV, and 
N ^ 137637 for mtREV; see text for details. 

'' Note that these ratios are not the ratios of the rates per site but per codon; see text for details. 



Table S3. ML estimates of the present models with the selective constraints based on the Grantham's 
and the Miyata's amino acid distances for the 1-PAM substitution matrices of JTT and WAG. 





JTT 


WAG 




Grantham- " 


Miyat 


a- " 


Grantham- " 


Miyata- " 




1 n 
iU 


il 


1 n 
iU 


1 1 
i i 


1 n 
iU 


1 1 
ii 


1 n 


1 1 
li 




— Wo 


(0.0) 


(0.0) 


(0.0) 


(0.0) 


(0.0) 


(0.0) 


(0.0) 


(0.0) 




82.0 


81.9 


1.71 


1.82 


58.9 


65.1 


1.28 


1.59 






n nQoo 
U.Uoyz 


(-> 0) 


0.617 


(-> U) 


U.oOo 


(-> 0) 


1.00 


^tc\ag/"l[tc][ag] 


2.12 


2.09 


2.32 


1.92 


1.49 


1.44 


1.64 


1.40 




1.08 


1.08 


1.05 


1.05 


1.18 


1.17 


1.15 


1.11 


™ta/m[tc][a9] 


0.864 


0.863 


0.925 


0.983 


0.987 


0.938 


1.02 


1.02 


"itg/rhit^^l^g-j 


0.961 


0.983 


0.922 


0.985 


0.816 


0.907 


0.813 


0.912 


^ca/m[tc\[ag] 


1.16 


1.16 


1.26 


1.12 


1.39 


1.32 


1.55 


1.23 


f mut 
Jt + a 

f mut / f mut 

Jt /Jt+a 


0.582 


0.581 


0.574 


0.543 


0.528 


0.517 


0.499 


0.466 


0.512 


0.513 


0.513 


0.505 


0.573 


0.562 


0.575 


0.531 


f mut / f mut 

•f c 1 ■} c-\-g 


0.384 


0.385 


0.448 


0.479 


0.412 


0.420 


0.513 


0.541 




2.80 


2.37 


2.98 


0.00938 


9.00 


2.97 


9.87 


0.00118 


TO 


0.0330 


0.0306 


0.0342 


0.0147 


0.0596 


0.0317 


0.0632 


0.0135 


^parameters 


30 


31 


30 


31 


30 


31 


30 


31 


X 108 


157835 


157281 


138419 


130721 


173694 


168463 


154639 


133347 


AAIC = 


18744.5 


18680.9 


16446.1 


15536.8 


5749.0 


5579.7 


5124.9 


4429.5 


Ratio of substitution rates per codon 


















the total base/the total codon 


1.35 


1.35 


1.35 


1.34 


1.51 


1.50 


1.51 


1.53 


transition / trans version 


1.04 


1.04 


1.07 


1.10 


0.768 


0.779 


0.791 


0.812 


nonsynonymous /synonymous'' 


2.21 


2.20 


2.14 


2.18 


2.54 


2.65 


2.53 


2.93 


Ratio of substitution rates per codon 


















for CT — 


















the total base/the total codon 


1.0 


1.02 


1.0 


1.33 


1.0 


1.16 


1.0 


1.53 


transition / trans version 


1.33 


1.31 


1.42 


1.10 


1.06 


0.951 


1.17 


0.813 


nonsynonymous /synonymous'' 


1.22 


1.28 


1.17 


2.17 


1.04 


1.52 


1.02 


2.93 


Ratio of substitution rates per codon 


















for Wab = and tr — ^ 


















the total base/the total codon 


1.0 


1.04 


1.0 


1.48 


1.0 


1.26 


1.0 


1.74 


transition / transversion 


1.12 


1.10 


1.21 


0.990 


0.803 


0.771 


0.881 


0.736 


nonsynonymous /synonymous'* 


2.67 


2.81 


2.63 


5.24 


2.97 


4.20 


2.92 


8.49 



" In all models, equal codon usage (/"^^^° = fa^^^" = /"^^^° = /"^^^° = 0.25) is assumed. If the value of a parameter is 
parenthesized, the parameter is not variable but fixed to the value specified. 

IklW = -{f-{0)/N + 2.98607330) for JTT, and -{l{e)/N + 2.97444860) for WAG; see text for details. 
'= AAIC = 2NiKL{0) + 2x #parameters with N = 5919000 for JTT, and N f» 1637663 for WAG; see text for details. 
'' Note that these ratios are not the ratios of the rates per site but per codon; see text for details. 



